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SOME  TECHNIQUES  IN  UNIVERSAL  CODING  AND 
CODING  FOR  COMPOSITE  SOURCES 


Mark  Stanley  Wallace,  Ph.D. 

Coordinated  Science  Laboratory  and 
Department  of  Electrical  Engineering 
University  of  Illinois  at  Urb ana- Champaign,  1982 

ABSTRACT 

We  consider  three  problems  In  source  coding.  First,  we  consider  the 
composite  source  model.  A  composite  source  has  a  switch  driven  by  a  random 
process  which  selects  one  of  a  possible  set  of  subsources.  We  derive  sonm 
convergence  results  for  estimation  of  the  switching  process,  and  use  these 
to  prove  that  the  entropy  of  some  composite  sources  may  be  computed.  Some 
coding  techniques  for  composite  sources  are  also  presented  and  their 
performance  la  bounded. 

Next,  we  construct  a  variable- length-to- fixed- length  (VL-FL)  universal 
code  for  a  class  of  unifilar  Markov  sources.  A  VL-FL  code  maps  strings  of 
source  outputs  Into  fixed-length  codewords.  We  show  that  the  redundancy  of 
the  code  converges  to  zero  uniformly  over  the  class  of  sources  as  the 
blocklength  increases.  The  code  is  also  universal  with  respect  to  the 
initial  state  of  the  source.  We  compare  the  performance  of  this  code  to 
FL-VL  universal  codes. 

We  then  consider  universal  coding  for  real-valuad  sources.  We  show 
that  given  sons  coding  technique  for  a  known  source,  we  may  construct  a  code 
for  any  class  of  sources.  We  show  that  this  technique  works  for  some  classes 
of  memory less  sources,  and  also  for  a  compact  subset  of  the  class  of  k-th 
order  Gaussian  autoregressive  sources. 
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INTRODUCTION 

The  general  problem  In  eource  coding  ie  that  of  dace  compression,  the 
data  which  is  produced  by  some  information  source  must  be  stored  or  trans¬ 
mitted.  Since  there  is  a  cost  associated  with  storage  and  transmission,  it 
is  of  interest  to  encode  the  data  into  as  small  a  number  of  bits  as  possible 
in  order  to  minimize  this  cost.  If  the  encoded  data  is  to  retain  all  of 
the  original  information  then  the  problem  is  one  in  noiseless  source  coding. 
If  there  is  some  allowable  distortion  then  the  problem  is  one  in  rate- 
distortion  theory  or  source  coding  with  a  fidelity  criterion. 

In  these  problems  an  information  source  is  modeled  as  a  discrete-time 
random  process.  The  source  output  at  each  time  1  is  a  random  variable  . 

The  distribution  of  this  random  variable  (which  may  depend  on  previous  source 
outputs)  determines  the  probability  of  a  given  source  output.  If  the  source 
outputs  (. . . ,X^,  xi+1,...)  form  a  stationary  random  process,  then  the  source 
is  said  to  be  stationary. 

A  code  is  defined  as  a  function  which  maps  blocks  of  source  outputs 
into  binary  strings  which  are  called  codewords.  The  rate  of  a  code  is  the 
expected  number  of  bits  which  are  used  to  encode  a  source  output.  If  a 
source  is  stationary,  then  its  entropy  is  defined.  The  entropy  is  a  lower 
bound  on  the  rate  of  any  noiseless  code,  and  noiseless  codes  exist  with 
rates  which  are  arbitrarily  close  to  the  entropy.  The  difference  between 
the  rate  and  the  entropy  is  called  the  redundancy. 

If  the  statistics  of  a  source  (i.e.,  the  distribution  of  the  source 
outputs)  are  known  then  a  noiseless  code  for  the  source  may  be  derived 
using  Huffman's  algorithm  [1].  This  algorithm  gives  f ixed- length- to- 
variab la-length  (FL-VL)  codes,  a  FL-VL  code  being  one  which  maps  a  fixed 
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nusfcer  of  source  outputs  into  a  variable-length  binary  codeword.  The 
redundancy  of  a  blocklength  n  Huffman  code  is  at  moat  n  \  so  a  Huffman 
code  may  be  derived  with  rate  as  close  to  the  entropy  of  a  source  as 
desired.  A  variable- length-to-fixed- length  (VL-FL)  algorithm  (Tuns tail's 
algorithm)  is  also  known  for  a  given  source,  and  if  the  blocklength  n  is 
defined  as  the  length  of  the  codewords,  then  the  redundancy  of  these  codes 
also  decreases  as  n”*. 

In  practice  the  statistics  of  a  source  are  seldom  known  exactly  so 
these  encoding  algorithms  do  not  apply.  Universal  source  coding  considers 
this  problem.  In  universal  source  coding  the  source  statistics  are  assumed 
to  lie  in  some  class.  The  goal  is  to  design  a  code  which  performs  well 
(l.e.,  one  which  has  a  small  redundancy)  for  all  of  the  sources  in  the 
class.  A  sequence  of  codes  of  increasing  blocklength  is  called  universal 
if  the  redundancy  approaches  zero  as  the  blocklength  Increases  for  any 
source  in  the  class. 

There  are  a  number  of  coding  techniques  which  yield  universal  FL-VL 
codes  for  various  classes  of  sources.  Much  less  is  known  about  universal 
VL-FL  codes.  In  Chapter  2  a  universal  VL-FL  coding  technique  for  Markov 
sources  is  derived,  and  its  redundancy  is  bounded. 

A  further  generalization  to  the  source  model  is  to  allow  the  source 
statistics  to  vary  with  time.  So  rather  than  having  a  source  with  fixed, 
but  unknown  statistics,  a  random  process  determines  the  statistics  of  the 
source.  This  random  process,  called  the  switching  process,  together  with 
the  set  of  possible  source  statistics  is  known  as  a  composite  source  [24]. 
Composite  sources  of  various  types  are  considered  in  a  number  of  papers, 


e.g.,  [2],  [3],  and  [8].  In  Chapter  1  we  consider  composite  sources  in 
which  the  switching  process  is  a  Markov  chain,  and  the  possible  sources  are 
memory less .  (The  outputs  of  a  memory less  source  at  two  different  times 
are  independent.)  The  state  of  the  Markov  chain  determines  the  probabilities 
of  the  various  source  outputs,  but  the  state  cannot  (in  general)  be  determined 
by  observing  the  source  outputs.  Some  convergence  properties  for  the  estimate 
of  the  source  statistics  given  the  outputs  are  derived,  and  these  are  used  to 
bound  the  accuracy  of  an  algorithm  to  compute  the  entropy  of  some  composite 
sources.  Some  coding  techniques  for  composite  sources  are  also  presented. 

In  source  coding  with  a  fidelity  criterion  the  rate  of  a  code  is  to  be 
minimized  without  exceeding  some  level  of  distortion.  The  fidelity  criterion 
tells  us  the  distortion  incurred  when  one  source  output  is  reproduced  as 
another  output.  Ihere  are  a  few  possible  approaches  to  coding  in  this  case. 
The  outputs  may  be  quantized  individually  into  some  finite  set  of  values 
and  then  encoded  using  a  source  model  such  as  those  used  in  Chapters  1  and  2. 
Another  way  is  to  design  a  code  which  maps  blocks  of  source  outputs  directly 
to  codewords.  This  is  known  as  vector  quantization.  There  are  a  number  of 
techniques  known  for  vector  quantization  under  various  constraints.  In 
Chapter  3  we  show  how  a  technique  of  vector  quantization  for  a  known  source 
may  be  used  to  generate  a  code  for  an  entire  class  of  sources. 
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CHAPTER  1 

STATE  ESTIMATION  AND  CODING  FOR  COMPOSITE  SOURCES 
1.1  Introduction 

A  composite  source  [24]  consists  of  a  set  of  subsources  and  a  switching 
process  which  selects  one  of  the  subsources  (see  Fig.  1).  We  consider 
discrete-time  composite  sources  with  memoryless  subsources  and  a  switching 
process  which  is  a  Markov  chain  with  state  space  J  **  {1,2,..., S},  S  <  ". 

Define  a  state  vector  Z(i)  -  (Zj^i), . .  .,Zg(i))  by  Zs(i)  -  1  if  the 
switching  process  is  in  state  s  at  time  i  and  Z  (i)  ■  0  otherwise.  The  1-th 
source  output  is  a  random  variable  X^  which  takes  on  values  in  an  alphabet  A 
according  to  the  distribution  Yz^(*)»  So  the  probability  of  a  given  source 
output  is  determined  by  the  state  of  the  switching  process,  and 

P(xt  -  xjz(i) , (Xi_1,Z(i-l)) , (Xi_2,Z(i-2 J, . . . J  -  PCxt  - x|z(i)3  * Yz(t) (x) . 

(1) 

We  refer  to  Z(i)  as  the  state  of  the  source.  The  switching  process  is 
specified  by  an  S  X  S  matrix  Q  with  elements 

qCs'Js)  -  P{Zgl(i+l)-l|Zs(i)-l}. 

Note  that  the  sequence  of  states  (Z(0)  ,£(1) , . . .)  is  not  determined  by  the 
outputs  (Xq,X^, . . .)  even  if  the  state  Z(0)  is  known.  These  sources  are  not 
unifilar  Markov  sources  [1],  pp.  187. 

The  composite  source  has  been  considered  as  a  model  for  time-varying 
sources  [2],  [3],  and  for  this  application  it  is  generally  assumed  that  the 
switching  process  is  slow.  We  do  not  assume  this,  in  fact,  all  of  our 
results  are  valid  even  if  the  source  changes  state  with  high  probability 
after  each  source  output. 


Since  the  etate  of  such  a  composite  source  cannot  in  general  be 
determined  from  the  outputs ,  it  is  of  interest  to  estimate  it.  Let 

Z(i)  -  E[Z(i)jxi,X1_1,...]  (2) 

be  the  conditional  mean  estimate  of  the  state  given  the  past  outputs. 

A  A 

Since  Z(i+1)  is  a  sufficient  statistic  for  Z(i+1)  may  be  generated 

A 

from  and  Z(i)  using  Bayes  rule  [4];  however,  some  initial  estimate 
is  required. 

The  first  part  of  the  chapter  is  concerned  with  the  properties  of  the 
estimation  process  Z(l).  Although  the  method  for  generating  die  estimates 
recursively  is  well  known,  very  little  is  known  about  the  convergence 
properties  of  such  processes.  In  Section  1.3  we  consider  the  situation 
where  no  initial  estimate  is  available,  and  prove  that  the  estimates  derived 
from  any  two  initial  estimates  will  converge.  For  composite  source  with 
only  two  states  we  show  that  the  recursive  computation  of  the  estimates  is 
stable.  That  is,  small  errors  which  are  Introduced  in  any  actual  implementa¬ 
tion  of  the  estimation  procedure  do  not  propagate.  This  result  is  not  easily 
extended  to  include  composite  sources  with  a  larger  state  space.  The  mean- 
square  error  of  the  estimate,  or  more  generally  the  expected  value  of  any 

A 

function  of  Z(l)  and  Z(i),  is  determined  by  the  stationary  distribution  of 
the  estimation  process.  However,  in  general  this  distribution  is  not  known 
to  be  unique.  We  show  that  the  estimation  process  has  a  unique  stationary 
distribution,  and  give  an  algorithm  which  may  be  used  to  compute  this 
distribution  to  any  desired  accuracy. 
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Next  m  turn  to  source  coding  problems.  The  determination  of  the  sources 
entropy  is  of  inters# t  ss  it  provides  s  lower  bound  on  the  rete  of  any 
code.  If  the  switching  process  is  stationery  then  the  output  process  is 
also  stationary  since  it  is  a  memory less  function  of  the  switching  process. 

So  the  probability  of  a  block  X  -  (X^Xj, . • • ,XQ)  of  source  outputs  given  no 
previous  outputs  may  be  determined  using  the  stationary  distribution  of 
the  switching  process  as  the  initial  estimate  2(0).  Since  die  source  is 
stationary  we  know  that  its  entropy  is 

lim  -n_1  E  P{X-x}log  P[x«x)  .  (3) 

n  -  -  x  €  An 

This  does  not  imply  that  the  estimation  process  has  a  unique  stationary 
distribution.  As  previously  mentioned,  however,  such  a  distribution  exists 
If  the  source  has  two  states,  and  in  this  case  the  entropy  Is 

J  H(Xl|!(0)  »z)n*(dz) 

where  ii*  is  the  stationary  distribution.  For  k-state  composite  sources, 
k  >  2,  we  do  not  prove  that  a  unique  stationary  distribution  exists. 

We  construct  fixed-length  to  variable -length  (FL-VL)  codes  for  composite 
sources  and  show  that  their  redundancy  is  bounded  by  n~\riogSl  +1).  (A  FL-VL 
code  maps  fixed-length  blocks  of  source  outputs  into  variable- length  codewords.) 
Again  propagation  of  errors  Is  a  problem,  and  so  it  is  not  clear  whether  the 
technique  is  Imp lemen table  for  long  blocklengths .  For  the  two-state  case 
we  show  that  errors  do  not  propagate.  In  addition  the  effect  of  Inexact 
knowledge  of  the  source  parameters  (i.e.,  switching  probabilities  and  subsource 
statistics)  is  bounded.  This  result  is  used  to  construct  a  universal  code  for 
a  class  of  two-state  composite  sources. 
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Finally  we  construct  codes  for  a  special  class  of  composite  sources  with 
an  Infinite  state  space.  This  class  of  sources  has  the  property  diet  the 
probability  of  switching  into  any  state  is  Independent  of  the  current  state. 
1.2  Convergence  of  State  Estimates  for  Two-State  Composite  Sources 

Let  6  be  a  composite  source  consisting  of  two  memory less ,  finite-entropy 
subsources  with  alphabet  A  and  a  binary  Markov  switching  process.  The  state 
at  time  t  is  Z£,  a  random  variable  taking  values  ■  to.l).  The  transition 
probability  matrix  Q  »  Cq(*t | of  the  switching  process  tZfc3  is 
specified  by  two  values.  For  ease  of  notation  let  or  “  q(l|0)  and  0  **  q(0|l), 
then  q(0|0)  -  1  -  a  and  q(l|l)  -  1-0.  The  composite  source  6  is  determined 
by  o,  0,  and  the  two  subsource  distributions  [y^(x);x  €  a}  i  ■  0,1,  so  we 
write  0  -  Let  A  denote  the  class  of  such  sources  for  a  given 

alphabet  A.  Define  the  estimate 

Zfc  -  E[Zt|Xt,Xt_1,...]  .  (4) 

This  estimate  has  the  following  property. 

Lemma  1.1.  Let  X  ^  (X_^,X_£ Then  X  —  Z_^  —  Zq  forms  a  Markov  chain. 
Proof:  If  z  -  E[Z_jJx”  -  x“]  then 

pCZq-sIz^^z.x"**-}-  iZ^pCz0-s|z_1-s' ,z_1-z>x'-x-JpCz_1-s' |2_1-t,x"-x_] 

1 

-  Z  P[ZQ-a|z  ,-s'}p{z  -s'|x"-x-)  (5) 

s'-O 

-  Ptz0-»|z.l-o3u-«)+PU0-.|z.l-i}« 
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whara  (5)  follows  sines  Z_^  is  a  function  of  X  and  since  the  transition 

probabilities  of  the  switching  process  do  not  depend  on  tits  outputs. 

*  * 

Given  Zfc  •  z  and  X^  ■  x,  then  Leona  1.1  Implies  that  Z^  is 

A 

given  by  Bayes  rule  [4],  so  Zfc  is  the  conditional  mean  estimate  of  Zc 
given  observation  of  the  source  output  up  to  time  t. 


fx00 


4  pC*t+i  "  x»2t*i  "  xl*t  -  * 
Kxt+i  -  *|zt  -  *3 


Y1(x)Tll(a) 

1 

£  yAx)T\  (z) 
i-0  .  x 


where 


(6) 


^(z)  A  P[zt+1  -  ijzt  -  z} 


/  Pz  +  (1 -<*)(! -z)  ;  i-0 


(l-p)z  +  a(l-z)  ;  i-1 


(7) 


If  we  define 

A  1 

p(x|z)  -  P[Xt+1  -  x|zt  -  z}  -  S  ,  (8) 

then  p(x|z)  is  the  probability  that  the  new  estimate  will  be  f^Cz)  given 

A 

that  the  old  estimate  was  z.  If  u  is  the  distribution  of  Zq  then  the  dia- 
trlbutlon  of  is  p/T,  where  T  is  the  measure  transformation  defined  by 

HT(B)  -  I  J*  ,  P<x|z)n  (dz)  ,  (9) 

x  €  A  fx  vB) 

where  1C  [0,1]  and  f'^B)  ^  [z  €  [0,1]:  ft(z)  6  B} .  The  transformation  T 


has  the  following  contraction  proparty.  The  distance  measure  used  here 
is  the  p -distance  [5]  which  is  defined  by 
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p(H,v)  -  inf  J*|x-y|TT(dx,dy)  ,  (10) 

rr  6  P 

where  P  is  the  set  of  joint  distributions  with  marginals  ^  and  v.  We  first 
prove  the  following  theorem. 

Theorem  1.1.  (Contraction  property  of  T  in  the  p -metric) 

If  |i  and  v  are  two  distributions  of  the  state  estimate  for  a  two-state 
composite  source  with  memoryless  subsources  and  if  T  is  the  transformation 
(9)  for  this  source  then 

p"0iT,vT)  2S  |M?(H,v)  (ID 

where  \  ^  i  -  or  -  0  and  or  ■  q(l|0)  and  0  *  q (0 | 1)  are  transition  probabilities 
for  the  switching  process. 

Proof:  See  Appendix  A. 

The  following  corollary  is  an  Immediate  consequence  of  Theorem  1.1. 

Corollary  1.1. 

p'(^i,v1)  2S  1111  p'(n0,vQ)  (12) 

where  *  HqT*  and  vi  "  '  w*  now  8^ow  that  a  unique  stationary 

distribution  exists  if  |\|  <  1. 

A 

Theorem  1.2.  The  state  estimate  Z£  has  a  unique  stationary  distribution  t±* 
if  |\|  <  1. 

Proof:  Since  the  space  of  possible  distributions  is  compact  in  the  p -metric 
we  know  that  a  subsequential  limit  exists.  For  any  two  distributions  ^q,v0 


il 


P  (^q»Vq)  ^  1 

so 

P0it,vt)  s  Jxl1  . 
Let  vQ  “  .  Then  (1A)  Implies 

P<^i.M-i+j)  ^  M1 


03) 

(1A) 

05) 


for  eny  J.  If  there  exist  two  subsequentiai  limits  ia*  end  n"  with 

p,  —  u'  end  i*.  -*  for  subsequences  It.  end  j.  then  for  erbitrery  i 
Ki  Ji 

min(Jlfkt) 

p(nk  )  £  M  1  1  .  (W) 

i  Ji 

It  follows  that  p(n',n")  ■  0,  and  thus  (i*.]  has  a  unique  limit.  Since 


Pi'  ■  lim  ■  lim  ■  n'T 

i  -.  <•  °  i  0>  ° 


(17) 


the  limit  is  stationary.  If  the  alphabet  A  is  finite  then  we  may 
compute  this  stationary  distribution  to  any  desired  degree  of 


accuracy  as  follows.  Let  p*  denote  the  stationary  distribution.  From  a 

distribution  concentrated  on  the  set  0  A  i  »  l,...,nj  we 

generate  a  distribution  £.  onQ'  ^  [f  (^-^);  x  €  A;  i  -  l,...,n}  using 

j  x  n 

the  recursive  equation  (9).  Then  a  distribution  fij  concentrated  on  H  is 
generated  using 


£j(  l  “^3)  “  £j((~“»  “1>  J  i  ■  1 . n  .  (18) 

(We  use  lx}  to  denote  the  set  containing  the  point  x.)  This  algorithm  is 
clearly  lmplementable  since  only  a  bounded  number  of  points  is  considered. 
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and  If  v«  define 


(19) 

fchen 

end 

e*  A  lia  e.  £  e  A  (2n[l  -  Jx|l3  1 
j  -  •  J 

(20) 

•j  £  |x|J  +  e  . 

(21) 

So  we  may  compute  JLj  such  thet  p  (? j  ,n*)  S  e  for  eny  *  >  0  br  holci? 
of  n  end  j  sufficiently  lerge.  Equations  (20)  and  (21)  follow  alpaca 

(19)  implies 

pC&j.iSj)  ^  (2a)"1 

(22) 

and  from  Theorem  1.1 

^  |x!p C*j„i*^*) 

80 

®j  ~  M*j-i  +  (2n)  1  ’ 

(23) 

In  the  limit  as  j  goes 

to  infinity  (23)  becomes 

a*  £  |l|e*  +  (2n)_1 

(24) 

which  gives  (20)  and  subtracting  e  from  both  sides  of  (23)  we  have 

ej-fiS  |X  |  (ej—1  *  •) 

(25) 

which  gives  (21) . 
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The  number  of  computations  required  increases  linearly  with  both  a, 
the  size  of  the  vector  which  approximates  the  Joint  distribution,  and  J,  the 
number  of  iterations.  The  storage  required  Increases  linearly  with  n.  If 
we  fix  J  according  to  the  limiting  error  e  by 

MJ  -  •  (26) 

then  j  is  of  order  log  n.  So  the  number  of  computations  required  to  derive 
a  distribution  within  n~^  of  the  true  stationary  distribution  Increases  as 
n  log  n,  and  the  storage  required  increases  as  n.  This  algorithm  was 
implemented  for  A  -  (0, l] ,  i.e.  binary  memoryless  subsources.  Two  computed 
distributions  and  their  associated  cumulative  distribution  functions  are 
illustrated  in  Fig.  2.  The  distributions  are  concentrated  on  1000  points,  and 
the  p -distance  between  these  distributions  and  the  stationary  distributions 
is  at  most  .006.  The  distributions  are  not  smooth,  and  it  does  not  appear 
likely  that  a  closed  form  analytical  description  exists. 

The  computed  distribution  may  be  used  to  bound  the  performance  of  the 
estimator  as  follows.  The  mean-square  estimation  error  is 

Et<Zt-Zt)2]  -  E(Z2]  -  E[Z2] 

-  E[Zt]  -  E[Z2]  (27) 

2 

where  (27)  follows  because  Z£  *  Z£.  If  the  switching  process  is  stationary 
and  ergodic  then 


(28) 


11)  cumulative  distribution  function 


pCz  - 1] 


Figure  2a.  Approximate  stationary  distribution  of  the  state  estimate  for 
a  composite  source  with  ar«P  m2,  Yq(0)  -  .5,  and  y^Cl)  •  .9. 
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If  n*  la  tha  stationary  distribution  of  the  estimate  Zfc  we  have 

E[Z2]  »  J*  z2  |i*(dz)  .  ( 

Given  a  distribution  {I  such  that  pG**t£)  is  small,  we  use  the  following 


theorem  to  bound 


d  ^  If  z2  dp,  -  f  z2  d^*] 
0  0 


in  terms  of  p(n*,p). 


Theorem  1.3.  If  p.  and  v  are  probability  measures  on  [0,1]  then 

1  1 

|J*  fdy-J  f<h>j  -  sup  jf'(x)|*  pGi,v)  .  (31) 

0  0  x  €  [0,1] 

Proof.  The  theorem  follows  directly  from  integration  by  parts.  That  Is 

1  .11 
J*  fdp.  -  f(x)j  -f  f’(x)l*[0,xldx 
0  0  0 
so 

11  1 

1J*  fdn-J*  fdv|  -  1J  f'(x)(v[0,x]  -i*[0,x])dx| 

0  0  0 


<J*  |f'(x)|  jv[0,x]  -  n[0,x]|dx 

0 


<  sup  |ff  (x0)|p(n,v)  . 

*0  6  [0,1] 

Equation  (33)  follows  because  for  one-dimensional  distributions  [5] 


p(H»v)  -  JlutO.x]  -  v[0,x]|dx  . 


So  if  f(z)  •  z  we  have 


i  «taC[Y1(x)(l-p)+Y0(x)P],tY1(x)a+Y0(x)(l-a)]}  .  (42) 

If  p(x|z)  ■  0  for  some  x  €  A  and  z  €  [0,1]  and  p’(x|z)  >  0  then  the  theorem 

does  not  provide  a  bound  on  A1.  However,  if  a,P  €  (0,1) 

then 

p(x|z)  *  6[Yx(x)+Y0(x)l  (43) 


where 


6  ^  nri.n[ of ,P  ,1  -Of,  1  -P} 


(44) 


So 


H'(X0|Z-;l-z)  <  -  Z  |p'(x|z))logC6[Y1(x)+Y0(x)]} 
x  €  A 


-  IM  Z  Jyx(x) -Y0(x)|lo8t6[Yl(x) +Y0(x)]} 


S  2  1x1  log  6  “1  +  1 X 1  PC(Y0)  -*“3C(Yl)  ]  (45) 

where 

3C(Yt)  ^  -  T  Yi^Jlog  Y£(*>  • 

x  €  A 

Recall  3C(y£)  is  assumed  to  be  finite.  If  we  do  not  have  cr,P  €(0,1)  a  bound 


may  still  be  derived  if  y£(x)  2  «  >  0,  for  all  x  €  A  and  i  »  0,1.  Note 


T 


A  ^  2p(&,p.*)  , 


hence  the  mean-square  estimation  error  may  be  confuted  to  any  desired 
accuracy . 

Under  certain  assumptions  the  entropy  o£  the  two  state  composite  sources 
may  be  computed  using  the  approximate  stationary  distribution.  Lemma  1.1 
and  (1)  imply  that 


X  -  2_1  -  XQ  (35) 

is  a  Markov  chain,  where  X  ^  (X_^,X_2> • ; • ) •  Since  Z_^  is  a  function  of 
X  it  follows  from  (35)  that  if  z  »  E[Z_^|x  *  x”]  then 

H(Xq|x'  -  X")  -  H(X0(Z_1  -  z) 

•  -  £  p(x|z)log  p(x|z)  .  (36) 

x  6  A 

Then  if  n*  is  the  stationary  distribution  of  Z_r  the  entropy  of  source  0  is 

1 

Kc<9>  “  J*  H(X0|Z_i  "  *)H*(dz)  .  (37) 

If  £  is  the  computed  distribution,  we  define 


Kc<0>  "  f  ,  -  z){i (dz) 

0 


where 


-  =  H(xol*-i  -  *)&(CyJ) 

i»i  i 


yi^n"1(i-%)  . 


We  now  use  Theorem  1.3  to  bound 


A'  A  J3C  (8)  -*(0)1  . 


s 


that  tha  alphabet  must  be  £lnite  in  this  case.  Under  this  assumption 
p(x)z)  i  «  so 

H'(X0|Z_1-Z)  2S  2  |x|log  t*1  .  (46) 

In  both  o£  these  cases  Theorem  1.3  implies 

4'  5  Kp  d.ti*)  (47) 

where  K  <  •  depends  only  on  the  parameters  of  the  source,  so  the  entropy 
may  be  confuted  to  any  desired  accuracy.  Note  that  the  complexity  of  the 
computation  is  the  same  as  that  of  the  computation  of  the  stationary 
distribution.  This  algorithm  was  implemented  and  the  entropy  was  computed  for 
some  two-state  composite  sources  with  binary  alphabets.  In  Fig.  3a  a  family 
of  curves  is  given.  In  each  curve  Yq(0)  “  Y^(l)  is  fixed  and  dr  ■  P  v^ies 
from  0  to  .5.  The  entropy  increases  to  one  as  the  switching  probabilities 
increase  as  would  be  expected.  The  same  curves  if  a  ftne  (3  are  replaced 

by  1-ff  and  1-0.  In  Fig.  3b  Yq(1)  ”  *001  and  Y1(l)  “  *5  for  all  curves. 

The  ratio  a/(a+0)  is  fixed  in  each  curve,  and  0  varies  from  0  to  .5.  So 
in  each  curve  the  proportion  of  time  spent  in  state  1  is  a/(a+P)»  Again 
the  entropy  increases  as  the  switching  probabilities  increase. 

The  p -convergence  result  may  also  be  used  to  show  that  estimates  which  are 

derived  using  different  initial  estimates  of  the  state  converge.  Consider 

two  different  initial  state  estimates,  z^  and  Zq.  If  z^(x*)  and  z^(xS  are 

the  estimates  at  time  i  derived  using  the  recursion  (6  )  when  X*  ■  (x^,...,x^) 

is  the  output  of  the  source,  the  following  theorem  shows  that  these  estimates 

1  i-i  . 

converge  on  the  average.  Define  p(x  |zQ)  •  II  p(x. . , |z.(xJ)),  where  p(x|z)  is 

i  J-ot  J  J 

from  (8).  So  p(x' |z)  is  the  probability  that  x  is  output  given  initial 

estimate  Zq. 
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Theorem  1.4.  With  *^(xS  and  as  above  (and  X  ^  1-a-P) 

VvV  A  I  -  *1C*i)lpC*±i*0)  *  Ixl1-  l*0 "  *0I  •  (48) 

Proof.  See  Appendix  a. 

Corollary  1.2.  If  rr  Is  any  joint  distribution  of  z^  and  £g  then 

WvV1  ■  J"*  ri<Io**o>dn 

£  Ixl1  (50) 

since  | Zq  -  Sq |  ^  1  for  all  z^  and  2q. 

Theorem  1.4  Implies  that  the  recursive  computation  of  the  state  estimate 
is  stable.  That  is,  suppose  that  some  error  e^,  where  |e^|  <  e,  is  introduced 
in  the  computation  of  the  1-th  estimate.  Then  if  the  initial  estimate  in  the 
computation  differs  by  some  e^  from  the  actual  initial  estimate  (l.e.,  the 
estimate  derived  from  observations  of  all  past  source  outputs),  the  average 
error  after  i  steps  is  bounded  by 

ir  il*i(51>-SiOE1)|p051|*o>!S  M^KI+cU-Mr1  •  (51) 

x  €  A 

Here  z^x*)  Includes  the  computational  errors  e^. 

Now  suppose  that  the  parameters  of  the  composite  source  are  not  known 
precisely.  That  is,  suppose  that  the  source  is  8  *  Co(,P,Yq»Y^}  and  we  use 
the  parameters  for  another  source  cp  «  (o' ,p ' iYg»Y[J  In  the  recursion  (6). 

Under  the  assumption  that  the  parameters  for  6  and  cp  are  within  s  the 
average  error  in  the  estimate  derived  using  the  parameters  for  cp  is  of  ord 
order  c.  Here  we  must  assume  that  9  ,cp  €  A*  (6)  for  some  6  >  0  where 
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A’ (6)  A  Ce  €  A:  p(x| z)  *  6 ,  V  x  €  A,  s  €  [0,1]}.  (52) 

This  condition  is  satisfied  if,  for  example,  Y^OO  »  S  >  0,  for  ell  x  €  A 
and  i  •  0,1.  Again  this  implies  that  the  alphabet  A  is  finite.  We  also 
Include  computational  errors  e^,  |e.J  £  c  .  Let  zQ  and  Zq  be  two  initial 
estimates  of  the  state  Zq.  Further  let  z^ (x~)  and  z^(xS  be  the  estimates 
derived  from  these  initial  estimates  using  the  recursions  for  6  and  <p 
respectively.  Note  that  now  these  estimates  are  derived  from  different 
recursions  and  that  z^  (x*-)  Includes  computational  errors  so 

^(x1)  -  (^(x1"1))  +  et  (53) 

where  f  is  defined  as  f,  (6),  (7),  but  with  the  parameters  for  cp.  Then  the 
following  is  true. 

Theorem  1.5.  If  8, <p  €  A' (6)  then 

VV'O*  “  Z  l8ifel> -«ife1)|iP<51|Bo) 

x 

*  i^q i 1  +  Kcci-igrl  (54) 

where 

\  ^  6"2[3t  +3c2  +  e3]  +e, 

p(x*Jzq)  is  the  probability  that  x"  is  output  from  source  6  if  the  Initial 
estimate  is  zQ,  and  Xq  ^  l-0f-P. 

Proof.  See  Appendix  A. 

So  the  estimation  procedure  is  robust;  that  is,  small  errors  in  source 
parameters  do  not  cause  unbounded  errors  in  the  estimates. 
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1.3  Convergence  of  State  Estimates  for  S -S tate  Composite  Sources 

Now  consider  Che  nore  general  case  where  the  Markov  chain  has  state 
spaced  *  [l,2, . . . ,s}  and  selects  one  of  S  subsources  which  are  discrete 
menoryless  sources  with  alphabet  A.  Let  Yg (x)  be  the  probability 
that  a  letter  x  €  A  is  output  given  that  the  Markov  chain  is  in  state  s  €  J. 
Let  A  denote  the  class  of  such  sources  for  a  given  A  and  J.  Define  a 
state  (row)  vector  Z(i)  -  (Z1(i),Z2(i),...,Zs(i))  by  Z#(i)  -  1  if  the  chain 
is  in  state  s  at  time  1  and  Z#(i)  *  0  otherwise.  Let  Q  *  Cq(i|j)}  be  the 
state  transition  matrix,  tfe  define 

2(1+1)  -  E(Z(i)|xi,Xi_1,...]  (55) 

A 

where  X^  is  the  output  at  time  i.  So  Z(i)  is  the  conditional  mean  estimate  of 

A 

Z(i)  given  the  outputs  up  to  time  1.  A  recursive  equation  for  Z  is  [61, 

|(i+l)  -  Z(i)T(x)[Z(i)T(x)irl  (56) 

where  »  x  is  the  source  output, 

T(x)AqP(x),  (57) 

P(x)  - 

is  a  diagonal  matrix  and  1  is  a  column  vector  of  l's.  The  probability  that 

A 

source  output  *  x  given  Z(i)  »  z  is 

p(xjz)  ^  zT(x)l  .  (59) 


Y1(x) 


y2oo 


‘Ys<x) 


(58) 
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Let  x  ■  (x^a...tx  )t  x  €  A,  and  define  e  matrix  T  by 

n 

T (x)  -  n  T(x.)  .  (60) 

i-1  1 

'"hen  if  Z(0)  •  z  end  x  consists  of  die  first  n  outputs,  the  n-th  state 
estimate  is 

Z(n)  -  s$  (x)  [s  T  (x)^]”*  (61) 

and  the  probability  of  x  given  z  is 

P(sl*)  *  I T  (*)  l  •  (62) 

A  source  6  €  A  is  specified  by  Q  and  [P(x):  x  €  a]  so  we  write  6  -  (Q,P(*)}. 
Let  IP  (f/)  denote  the  set  of  probability  distributions  on  J .  We  now  show 
that  under  certain  conditions  if  z  and  z  are  in  IP  (//),  then  the  estimates 
generated  from  (61)  converge.  Define 

A(c)  -  Ce  €  A:  q(ilj)  a  e  >  0,  i,j  €  a>}.  (63) 


Then  we  have  the  following  theorem. 

Theorem  1.6.  Let  6  €  A(c).  If  x  and  z  are  probability  vectors  on  J  such 
that  p(x|£)  >  0  and  p(x|S)  >  0,  x  ■  (x^, ...,x  ),  then 

II*  T  (*)[P(*|*)]"1  -  |T  (x)(p(x|i)rl||  <  Qaml  (64) 

where 

2  2 

a  m  . 

(65) 


r  ft 

4  2  2 

(1-  (S-l)*r+c 


and  ||  »||  is  the  norm  defined  by  ||u-vj|  -  max(|u^  •  v^j  :1  £  IS  n}. 
Proof.  See  Appendix  A. 
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The  rate  of  convergence  here  is  not  as  fast  as  that  of  the  average 
convergence  result  for  the  two-state  composite  source,  but  the  convergence 
bound  holds  for  any  sequence  of  outputs  and  not  merely  an  average.  The 
restriction  that  we  oust  have  p(x|z)  and  p(x|£)  positive  Is  of  no  real 
lnq>ortance ,  since  If  p(x|z)  Is  zero,  this  means  that  the  estimate  z  is 
incorrect  so  we  may  choose  a  new  initial  estimate  z '  such  that  p(x|z' )  >  0. 
A  more  important  drawback  here  Is  that  the  theorem  does  not  imply  that  the 
estimates  converge  at  each  step  (In  fact  they  do  not  In  general),  but  only 
that  after  n  steps  they  are  within  Qn  The  theorem  does  not  imply  that  a 
computed  estimate  remains  close  to  the  true  estimate  despite  small 
computational  errors  at  each  step. 

Theorem  1.6  also  applies  in  the  more  general  case  where  the  transition 

matrix  Q  depends  on  the  current  output  x.  So  we  have  a  family  of  matrices 

[Qx:  x  €  a}.  If  we  assume  that  the  elements  of  Qx  are  at  least  c  for  all 

x  €  A,  then  the  theorem  holds.  The  only  change  necessary  in  the  proof  is 

that  Q  is  replaced  by  Q^. 

1.4  Generalization  to  Arbitrary  Subsources 

Some  of  the  estimation  results  also  hold  for  memoryless  sources  (not 

necessarily  finite  entropy)  having  an  arbitrary  alphabet  A.  Consider  first 

the  two  state  composite  source.  Where  previously  we  assuamd  that  the 

alphabet  A  was  countable  and  that  the  subsources  had  finite  entropy,  here 

we  assume  that  the  sources  are  specified  by  two  probability  measures  Pq 

and  P.  on  an  alphabet  A.  If  we  define  tt  ■  fc(P_+P.)  then  the  Radon-Nikodym 
1  dP  u  i 

*i  * 

derivative  exists,  i  ■  0,1.  Then  given  the  i-th  state  estimate  Z,  m  z 
and  the  (l+l)-st  source  output  Xi+1 


we  have 
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Zi+1  “  Fx<‘> 


dPl 

~d?  (x)  Vz) 

1  dP 

=  -di <*>  V> 

1-0 


(66) 


which  Is  Bayes  rule  for  this  case.  Define 

1  dPi 

p(*l*)  ■  2  “dnVz) 
1-0  av  x 


(67) 


so  that 


P^Xi+l€  BIZ1  *  "  J*  P<xl*>  n<dx>  • 

B 


(68) 


Then  If  Is  the  distribution  of  Zq,  the  distribution  of  Z^  Is  given  by 


MB)  “J  J*  .1  p(x|z)nQ(dz)  TT(dx)  .  (69) 

A  fxA(B) 

If  we  use  the  recursion  (69)  In  place  of  (9)  then  Theorem  1.1  holds  for  these 

generalized  subsources .  The  only  modifications  necessary  to  the  proof  of 

dPi 

Theorem  1.1  are  to  replace  Y^(x)  by  1-  *  0,1,  and  to  replace  all 
sunmations  over  the  alphabet  A  by  Integration  with  respect  to  the  measure  tt. 
Corollary  1.1  and  Theorem  1.2  follow  directly  from  Theorem  1.1  so  we  know 
that  the  state  estimate  Z^  has  a  unique  stationary  distribution.  However, 
the  computation  of  an  approximation  to  this  stationary  distribution  may  not 
be  performed  as  it  was  in  Section  1.2  because  the  alphabet  is  not  finite. 

The  average  convergence  of  Theorem  1.4  also  holds,  If  we  modify  the  proof 
In  the  same  way  a*s  the  proof  of  Theorem  1.1.  Theorem  1.5  Is  not  easily 


generalized  though,  as  it  was  necessary  to  assume  finite  alphabet  size. 
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The  convergence  result  (Theorem  1.6)  for  S -state  composite  sources  also 

generalizes.  Let  P, ,  1  ■  1,2,...,S  be  probability  measures  on  A  for  the 

-1  S 

subsources .  Then  if  we  define  tt  ■  S  Z  P, ,  the  Radon-Nikodym  derivative 

dPi  1-1  dpi 

— exists,  i  ■  1,2,...,S.  If  we  replace  Y^OO  by  -jj^(x)  in  the  definition 
of  P(x)  (58)  then  Theorem  1.6  holds,  and  the  same  proof  is  valid. 

1.5  A  Coding  Technique  for  Composite  Sources 

Let  6  be  a  composite  source  as  in  the  previous  section.  The  switching 
process  has  state  spaced  -  [l,2,...,s}  and  each  subsource  is  a  discrete 
memoryless  source  with  alphabet  A.  If  the  state  of  the  switching  process 
is  s  then  the  probability  of  the  source  output  x  is  Y_(x)»  independently 
of  previous  states  and  source  outputs.  Let  IP  (//)  be  the  set  of 
probability  distributions  on  J  and  define  e^  €  IP  (ft )  to  be  the  probability 
(row)  vector  whose  j-th  element  is  one.  If  the  switching  process  is  in 
state  j  at  time  t  we  define  the  state  Z(t)  *  e ^ .  The  transition  probability 
matrix  for  the  switching  process  will  depend  on  the  current  state  and  the 
current  source  output.  So  we  define 

qx(l|j)  -  P[Z(t+l)  -  el|z(t)  -  eJ,  X(t)  -  x}  (70) 

and 

Qx  -  Cq^dlj):  i,J  €  W  .  (71) 

We  do  not  require  the  elements  of  Qx  to  be  bounded  by  some  c  >  0  (as  was  the 
case  in  the  previous  section).  Note  that  this  class  Includes  unifilar 
Markov  sources,  that  is,  sources  where  the  next  state  is  a  deterministic 
function  of  the  current  state  and  source  output.  For  these  sources  the 
elements  of  the  matrices  Q^,  x  €  A,  are  either  zero  or  one. 
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We  now  construct  a  variable  race  code  for  e  given  conposite  source  and 
bound  ics  redundancy  uniformly  over  all  initial  state  estimates.  The  codes 
considered  here  are  fixed-length  to  variable-length  (FL-VL)  codes,  so  they 
encode  fixed-length  blocks  of  source  outputs  into  variable- length  binary 
codewords.  The  block length  of  a  FL-VL  code  is  the  number  of  source  letters 
encoded  in  a  block.  The  n-th  order  entropy  of  source  6  given  initial  state 

A 

estimate  Z(0)  ■  z  is  given  by 


where 


Hn(9,a)  “  -n'1  2  _  P(x|i>  log  p(x|a)  (72) 

x  €  A 

p(x|z)  ^  sT(x)l  (73) 


and  T  is  as  defined  in  (60).  So  Hn(0,£)  is  a  lower  bound  on  the  rate  of  any 
blocklength  n  code  for  source  8  and  initial  state  estimate  z.  Let  j&n(x) 
be  the  length  of  the  binary  codeword  for  the  output  block  x  €  An.  Then  the 
rate  of  the  code  is 


Rn(6,z)  &  n-1  £  _PQc|z)in(x)  (74) 

x  €  A 

and  the  redundancy  is 

rnO.*>  6  Ra(e.£)  "  V0’*)  •  (75> 

If  we  let  z  “  (z^ . Zg)  then 

P(*.s)  *  I T <*>1  (76) 

S  t 

■  2  *tP(*li  )  (77) 

i-1  1 

S  ®axCp(x|eS :  i  €  (78) 
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A 

So  to  design  a  code  for  8  and  Z(0)  ■  z  we  first  design  codes  for  initial 
state  estimates  e*,  i  €  J,  and  combine  these  S  codes  into  a  single  code  by 
prefixing  each  codeword  with  flog  si  bits.  The  code  for  initial  state  e* 
is  the  Shannon  code  for  probabilities  p(x|e^),  so  the  length  of  the  codeword 
for  x  is 

(*)  “  T-log  p(x|e^)l  (79) 

SI-  log  p(x|e^)  .  (80) 

The  codeword  for  x  in  the  combined  code  is  then  the  shortest  of  the  S 
possible  codewords,  so  the  length  function  of  the  combined  code  is 


l  (x)  *  min{jfc^(x):  i  €  +  Tlog  Si  (81) 

<  -log[maxCp(x|e*") ;  i  €^)]  +  l+l"log  si  (82) 

S  -log[p(x)z)]  +1+  i*log  Si  ;  V  s  €  IP  V)  .  (83) 

A 

The  rate  of  the  code  when  applied  to  8  with  Z(0)  *  z  ia 

R  (0,z)  S  n_1Cl  +  ("log  Si  -  E  p(x|z)log  p(x|z)}  (84) 

n  ~  x  €  A 

-  Hn(8,z)  +  n_1[l  +  Tlog  Si  ]  ,  (85) 

and  so  its  redundancy  is  bounded  by 

r  (8,z)  S  n-1ll+ Tlog  Si  ],  (86) 

n  — 


for  all  z  €  IP  ) . 
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One  problem  with  this  coding  technique  involves  the  propegetion  of 
errors.  To  determine  the  codeword  lengths  the  probabilities  of  the 
codewords  must  be  determined.  This  requires  n  matrix  multiplications,  and 
there  is  no  guarantee  that  errors  will  not  propagate.  Theorem  1.6  inplies 
that  the  effect  of  an  error  on  one  step  will  decrease  exponentially,  but 
does  not  imply  that  the  effect  of  small  errors  made  in  each  step  will  remain 
small.  Propagation  of  errors  is  not  a  problem  when  coding  for  a  unifllar 
Markov  source  with  finite  space  and  alphabet.  For  such  a  source  the 
probability  of  a  source  vector  x  given  initial  state  s^  is 

n 

*(X”x|s0)*  n  P(x.js  ,)  (87) 

i-1 

N[(x,s),(x,sq)] 

-  n  n  p  (x  |  s )  u  (88) 

x  €  A  s  €  J 

where  N[(x,s), (x,Sq)]  is  the  number  of  times  in  the  block  x  that  the  letter 
x  occurs  when  the  source  is  in  state  s  given  that  the  initial  state  is  s^. 
The  product  (88)  may  be  computed  using  at  most  |a|*|*^|  multiplications  for 
any  n,  so  the  effect  of  computational  errors  need  not  Increase  as  n  becomes 
large.  Some  further  convergence  result  is  required  to  show  that  the  code 
for  composite  sources  is  implemen table,  although  in  view  of  the  convergence 
result  of  Theorem  1.6  it  is  probable  that  the  computation  is  stable. 
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OtC  [7]  considers  the  saute  coding  problem  but  assumes  that  the  encoder 
and  decoder  have  the  initial  state  estimate  for  the  source.  The  code  he 
constructs  is  simply  the  Huffman  code  for  the  source  given  a  specific 
initial  state  estimate  (and  is  optimal  for  that  estimate),  but  it  is  not 
universal  with  respect  to  the  initial  state  estimate.  He  does  not  prove 


any  convergence  results  which  would  indicate  chat  the  computation  is  stable. 

One  modification  which  improves  the  code  performance  is  as  follows. 
Since  only  2n  codewords  of  the  S  x  2°  possible  codewords  are  used,  the 
additional  codewords  may  be  removed  and  the  remaining  ones  shortened.  This 


technique  is  employed  in  Section  4  of  [11].  Define 


p*(s)  “ 


E  n2 

£  6  A 


jtQ(x)  *  r”lo«  P*^)1 


(90) 


so  the  Shannon  code  for  p*  performs  at  least  as  well  as  l  . 

n  n 

The  performance  of  codes  with  blocklengths  n  *  5,  8,  and  10  which 
incorporate  this  modification  are  presented  in  Fig.  4.  The  sources  for 
which  the  codes  are  designed  have  Yq(0)  *  Y^(l)  *  .9  and  a  *  f3  between 
0  and  .5.  Each  curve  gives  the  performance  of  a  set  of  codes  of  the 
same  block  length.  The  rates  and  source  entropy  are  given  in  Fig.  4a, 
and  the  redundancies  in  Fig.  4b. 


b)  Redundancy  of  codes  with  blocklengths  n  -  5,  8,  and  10 

Figure  4.  Performance  of  the  coding  technique  for  two-state  binary 

composite  sources  with  Y0(0)  -  YjW  -  0.9  and  a  «p  in  [0,.5] 


1 
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1.6  Stability  of  Coding  and  Universal  Coding  for  Composite  Sources 
If  the  composite  source  consists  of  only  two  subsources  as  in 
Section  1.2  we  may  show  that  the  effect  of  computational  errors  on  the 
redundancy  may  be  made  small.  This  result  is  implied  by  the  following 
theorem  which  bounds  the  mismatch  redundancy;  that  is,  the  redundancy 
which  results  when  a  code  designed  for  a  source  9  is  applied  to  another 
source  6.  The  theorem  includes  the  effect  of  con^utational  errors.  We 
return  to  the  notation  of  Section  1.2.  Let  6  *  [ar,p  *Yq»Yj3  and 
9  *  (o' »P  1  »Yq»Y^}  be  two  composite  sources  (recall  that  a  “  q (1 1 0)  and 
P  *  q(0] 1)  are  the  transition  probabilities  for  the  switching  process). 
We  assume  that  6,9  €  A' (6)  where 


A'(5)  ^  [9  €  A:  p(xjz)  *  6,  x  €  A,z  6  [0,1]}  .  (91) 

Let  Pq^Izq)  be  the  probability  that  x*  *  (x^^ . x^)  is  the  output  of 

source  0  given  initial  estimate  zQ,  and  similarly  for  P^Qc^Zq).  Let  z^Qc  ) 
be  the  estimate  of  the  state  used  in  designing  the  code.  So  z^(x^)  includes 
computational  errors  e^  as  in  Theorem  1.3  and  we  again  assume  that 
|e^|  <  c.  Then  if  the  initial  estimate  is  zQ  the  mismatch  redundancy  is 


we> 


n  *[l  + 


nZ  nPs^nlZ0)l  min  t  f"log  P^C*n 1 3^)1  }  ]  +  log  Pq  (xR|  Zq)} 
xn  €  An  H  u  k-0,1  * 


(92) 


Theorem  1.7.  If  9,9  6  A' (8)  and  corresponding  parameters  (i.e.,  switching 
probabilities  and  subsource  statistics)  for  sources  0  and  9  are  within  c, 
then  if  ln  is  the  code  designed  for  source  9  we  have 


r  (f  ,9)  <  Kn"1  +  Kc 
n  n 


(93) 
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where 

K  &  2  +6"1log  e  [1-  ll-a-Pl]*1  (94) 

and 

K'  ■  6-1log  e  [1- |l-a-P|]_1C6"2[3«+3e2+e3l+5c+2c2}  (95) 

Proof.  See  Appendix  A. 

We  may  use  Chis  mismatch  result  to  construct  a  sequence  of  mini  max 
universal  codes  for  any  subset  {  of  A'(6).  A  sequence  of  codes 
(l*:  n  ■  1,2,...}  is  said  to  be  minimax  universal  for  a  class  of  sources  $ 
if  the  redundancy 

r  (X*,0)  **  0  (96) 

n  n 

uniformly  on  5  as  n  -•  •.  We  construct  the  code  as  follows.  The  alphabet  A 
is  assumed  finite  so  let  A  *  [l,2,...,j}.  Let  i,j  and  K(m,x);  m  ■  0,1, 
x  -  1,2, . . . ,J-1  be  nonnegative  Integers  less  than  n.  Define  a  set 

Bn(i,J,K(«,*))  -  (6  €  t:  a  €  [in*1, (i+l)n’1],  P  €  [ Jn*1, (j+Dn"1] , 

Y_(x)  6  [K(m,x)n  1, (K(m,x)+l)n  1]}.  (97) 

m 

Note  that  B  has  dimension  2  +  2(J-1)  ■  2j  since  each  subsource  is  specified 
n 

by  J-l  parameters.  From  each  non-empty  set  Bq  choose  an  element  q>  called 
the  design  point  source.  The  number  of  design  point  sources  is  bounded  by 

2  T 

n  ,  since  there  are  at  most  this  number  of  sets  B  .  A  Shannon  code  t 

n  n,9 

is  then  constructed  for  each  of  the  design  point  sources  as  in  Section  1.5. 

A  prefix  of  length  l*2j  log  id  which  Identifies  cp  is  attached  to  the  codewords 

in  the  code  l  .  The  universal  code  is  then  constructed  by  combining  these 
n,<p 

codes.  The  universal  code  is  uniquely  decodable  since  the  prefix 
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specifies  o  and  since  the  codes  i  _  are  uniquely  decodable.  The  encoding 

n,cp 

2j 

procedure  is  simply  to  choose  the  shortest  codeword  of  the  n  possible 
codewords  for  a  given  output  block  x,  so  the  length  function  for  the 
universal  code  is 


!*(x)  ■  T2J  log  hi  +  m±ntin  (x)}  (98) 

<P 

for  any  9  €  i  there  exists  a  design  point  source  cp  whose  parameters  are 
within  n~*  of  the  parameters  of  6.  Let  cp  be  this  design  point  source  for  6. 
Then  we  have 

rn0e*,8)  ^  n-1r2j  log  nl  +  rnCfcn>q),9)  .  (99) 

Since  r  (1  ,9)  is  the  mismatch  redundancy  of  Theorem  1.7  with  c  *  n 

n  n,cp 

we  have 

r  (f*,9)  £  n-1{r2j  log  hi  +  K  +15}  (100) 

n  n 

for  all  9  €  $  and  the  sequence  of  codes  A*  is  minimax  universal. 

If  some  of  the  source  parameters  are  fixed  for  all  9  €  A  so  that  A  has 
dimension  M,  where  M  <  2J,  then  2j  log  n  is  replaced  by  M  log  n  in  (100). 

To  illustrate  this  procedure.  Fig.  5  contains  a  graph  of  the  redundancy 
of  a  blocklength  8  code  for  the  class  of  binary  two-state  composite  sources 
with  Yq(0)  ■  Yj^(l)  "  *9  end  a  -p  in  [0,1].  The  code  was  constructed  by 
combining  codes  designed  for  -.05,  .30,  .70,  and  .95  respectively. 

The  redundancies  of  the  codes  for  a  m  .05  and  .30  are  also  graphed  over  the 
class  of  sources.  If  these  curves  are  reflected  about  a  -  .5  then  they 
become  the  curves  for  or* .95  and  .70.  Mote  that  the  maximum  redundancy  of 
the  combined  code  is  much  less  than  those  of  the  other  codes. 


1.7  Coding  for  an  Infinite-State  Composite  Source 

The  coding  technique  derived  in  Section  1.5  applied  to  composite 
sources  with  a  finite  number  of  subsources.  We  now  construct  a  code  for 
a  certain  type  of  composite  source  with  an  infinite  state  space,  and  show 
that  the  rate  of  this  code  approaches  the  entropy  of  the  source. 

The  state  space  J  is  the  class  of  all  memoryless  sources  with  alphabet 
A  ■  (1,2,  ...,J).  We  define*/  such  that  If  £  -  (y^, . . .  .y^  ,)  1®  «  source 
in  J  then 

Y^OO  ^  PCxt  -  k|zt  - 


(101) 

A  J_1 

A  1  -  i  yk  • 

i _ i 

(102) 

At  each  Integer  time  1  the  switching  process  Z ^  changes  with  probability 
a.  If  it  does  change  then  It  takes  on  a  new  value  according  to  a 
probability  measure  P*  on  J  which  does  not  depend  on  the  previous  state. 

So  each  time  the  source  changes  state  the  effect  of  the  past  states  is 
eliminated.  We  first  assume  that  P*  has  a  density  which  we  denote  z*.  So 
if  zi  “  1  €  «/  then  Zi+1  ■  £  with  probability  1  -or,  and 

p(zt+i  6  B}  -  P*(B)  (103) 

with  probability  a,  where  B  is  a  subset  of  J. 

The  estimate  of  the  state  Z^  given  the  past  outputs  (xj>,x£.i>,  •  •  •)  *® 
a  probability  measure  P*  on  J  such  that 


38 


*  A 

If  we  assume  that  this  measure  also  has  a  density  Z^  we  may  derive  Zi+1 
from  Zi  and  Xi+1  using  Bayes  rule.  Let  pz  be  the  density  of  Z^  given 

A 

Zq  ■  Zq.  Then  we  have 

pt  -  az*  +  (1  -a)zQ  .  (105) 


Further,  if  px  z  is  the  joint  density  of  Z^  and  given  ZQ  -  Zq  then 

pX,Z^,k)  "  pCxi*klzi"^Pz(2) 


•  A 

So  if  X^  •  k  and  Zq  -  Zq  then  Z^  is  given  by 


Px.z<*-k) 


yk[a**<i)  +  (l-a)*0CL)] 

*  J*  yk**(2)di  +  (X -or>J  ykz0(x)d2 


(106) 


(107) 


where  £  “  (y^» • • • »yj_^) •  Since  Z^  is  of  the  form 

zl(Z>  "  yk[Kz*(£)  +  K'Zq(£)]  ,  (108) 

A 

where  X  and  K'  do  not  depend  on  £,  all  subsequent  densities  Z^  derived 

A 

from  Z^  will  be  of  the  form 

Zt(£)  -  I  II  [yj]  J[K(®^> . . . ,mj)zQ(£)  +K’  (m^, . .  .,»j)z*(£)] 

®1* • • • »®j  j-1  (109) 

Z  ®j  ^  1 


E3 
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wh*r«  K(»)  and  K*  (•)  do  not  depend  on  £.  So  although  the  estimate  Z±  la 
infinite  dimensional,  given  zQ  and  z*  only  a  finite  number  of  constants 
K(0  and  K' ( • )  are  required  to  specify  Z±  for  any  i.  Further,  knowledge  of 
the  moments  of  zQ  and  z*  is  sufficient  to  compute  these  constants. 

St 

The  probability  that  X^  ■  k  given  Zq  *  Zq,  denoted  p(k|zQ>,  is  the 
denominator  of  (107),  and  the  probability  of  a  block  of  source  outputs  x, 
denoted  p(x|zq),  may  be  computed  by  generating  recursively  from  (107). 

Given  this  estimation  procedure  we  construct  a  code  as  follows.  Compute 
the  probabilities  p(x|z*)  of  output  blocks  x  €  An,  where  z*  is  the  density 
of  P*  as  previously  defined.  The  code  is  then  the  Huffman  code  for  these 
probabilities.  So  if  the  length  function  of  the  code  is  Aq  then  this  code 
minimizes  the  redundancy 

rn(0,z*)  -  E  nP(x|**)Un(£)  +  log  p(xlz*)l  .  (110) 

x  €  A 

Let  3C(9)  ^  H(X0|X_1, ...)  be  the  entropy  of  source  9.  The  probability  of  x 

given  no  previous  source  outputs  is  p(x|z*)  since  z*  La  the  stationary 

distribution  of  the  switching  process.  If  we  define  R  (9)  to  be  the  average 

n 

rate  of  the  code  L when  applied  to  source  9  then 

RQ(9)  -  n'1  E  P(x|z*)in(x)  .  (Ill) 

x  €  A 

The  following  theorem  gives  an  upper  bound  on  the  average  redundancy  of  the 

code  A  . 
n 

Theorem  1.8.  Let  *q(Q)  ■  *  ^c(®)  be  the  redundancy  of  the  code  AR.  Then 

rQ(0)  <S  n”1(l  -  i  log  J)  . 


(112) 


Proof .  See  Appendix  A. 

We  assume  that  both  P*  end  the  estimates  P*  have  densities.  If  they  do 
not  the  estimation  procedure  may  be  modified  as  follows.  Let  tt  -  %(P*  +  P®), 

A  J  AX  J  pi 

where  Pw  is  the  initial  estimate.  Then  t —  and  •? ■—  exist  for  all  U  0. 

g  d  TT  a  TT 

If  we  replace  z*  and  tg  by  and  in  (105)-(109)  and  integrate  with 
respect  to  tt,  then  (107)  gives  a  recursion  for  .  The  code  is  then 


defined  as  before  and  the  redundancy  bound  holds. 


CHAPTER  2 


UNIVERSAL  VL-FL  CODING  FOR  MARKOV  SOURCES 


Introduction  and  Review  o£  Previous  Results 


An  efficient  universal  noiseless  source  coding  technique  is  presented 
in  [11]  for  memory less  sources.  It  is  extended  to  unifllar  Markov  sources 
in  [12]  and  [13].  The  codes  constructed  in  these  papers  are  fixed- length - 
to-variable- length  (FL-VL)  codes;  that  is,  they  encode  fixed-length  blocks 
of  source  outputs  into  variable-length  binary  codewords.  We  use  the  same 
basic  technique  to  construct  universal  variable-length- to-fixed-length 
(VL-FL)  codes  for  unifllar  Markov  sources.  The  performance  of  these  VL-FL 
codes  for  binary  memoryless  sources  is  compared  to  that  of  the  FL-VL  codes 
constructed  in  [11].  We  show  that  for  medium  blocklengths  (~  10)  the  VL-FL 
codes  perform  better  and  that  for  long  blocklengths  (~100)  they  perform 
about  as  well  as  the  FL-VL  codes. 

Next  a  review  of  some  terminology  of  universal  noiseless  coding  [11]  in 
a  fixed-length-co-variable- length  (FL-VL)  framework  may  be  helpful.  Let  A- 
be  a  class  of  stationary  sources.  Each  8  €  A  has  a  probability  function 
pg  which  gives  the  probability  of  the  various  possible  strings  of  outputs. 

A  FL-VL  code  of  blocklength  n  maps  blocks  of  n  source  symbols  into  variable- 
length  binary  sequences.  Let  x  *  (x^,...,xn)  be  a  block  of  source  outputs. 
A  FL-VL  code  is  specified  for  our  purposes  by  the  length  function  IQ(x) 
which  gives  the  length  of  the  codeword  for  x.  The  rate  of  a  FL-VL  code 
applied  to  a  source  8  is 


W»  '  “'l  ?  .a  *„<SX-8<£> 


(113) 
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where  An  is  Che  sec  of  possible  n-tuples  from  source  9.  Defining  Che  n-th 
order  per-letter  encropy  of  9  as 

H„(9)  *  -n”1  Z  n  p  (x)log  pQ(x),  (U4) 

x  €  A 

Che  n-th  order  redundancy  of  the  code  is 

rn<V9)  "  W9>  "  V9)  •  <ll5> 

Let 

rnC*^  -  supCrn(in,9):  9  €  A}.  (116) 

A  sequence  of  codes  l^,L 2,...  is  weakly  universal  if 

R  ( l  ,9)  -  H(9)  V  9  6  A  (117) 

n  n 

as  n-»  where  H(9)  «  lim  H  (9)  is  the  entropy  of  the  source  9.  It  is 

n  -4  •  n 

strongly  universal  if  the  convergence  of  (117)  is  uniform  and  minimax 
universal  if  £(1^)  -  0  as  n  -  Let  be  the  set  of  blocklength 
in  FL-VL  codes.  We  define  the  n-th  order  FL-VL  mLnlmax  redundancy  as 

«F(n)  -  inf[rn(XQ):  l q  6  Xj  .  (118) 

We  now  define  similar  quantities  for  VL-FL  codes.  A  VL-FL  code  maps 

variable -length  strings  of  source  outputs  into  fixed- length  binary  codewords.  The 

• 

performance  of  a  VL-FL  code  is  determined  by  a  set  T  which  consists  of  the 
variable-length  strings  of  source  outputs  which  are  encoded.  The  blocklength 
of  a  VL-FL  code  is  the  length  of  the  codewords  and  is  denoted  by  n.  So 
n  *  flog  Jr  (I  where  Jr  J  is  the  cardinality  of  the  set  T  and  fal  represents 


Che  smallest  integer  not  less  than  a.  Since  T  completely  specifies 
the  code  we  refer  to  T  as  the  code.  Let  4(x)  be  the  number  of  letters 
in  the  string  x.  The  race  of  a  VL-FL  code  r  applied  to  a  source  6  is 


RnCT,9)  -  ntIe(T)] 

where 


(119) 


Ie(T)  A  2  p  (x)i(x)  (120) 

x  €  r 

is  the  expected  length  of  the  input  strings.  We  may  define  a  lower  bound 
on  the  rate  of  this  code  as 


K(T,9)  «  -  E  ^Pq  (x)  log  pQ  (x)  UqCT)]"1  .  (121) 

So  3C(T,0)  is  the  entropy  of  the  set  of  strings  x  €  T  divided  by  the  expected 

length  of  these  strings.  The  redundancy  of  the  code  T  is  defined  as 

rn(r,0)  -  Rn(T,0)  -  JC(T ,0)  (122) 

and  the  maximum  redundancy  is 

r n(T)  -  8wpCrri(T  ,9) :  0  €  A)  .  (123) 

If  *£  is  the  set  of  all  VL-FL  codes  of  blocklength  n  then  define 

Rv(n)  -  inf{rn(T):  T  €  .  (124) 

For  the  definitions  (113)-(124)  it  is  assumed  that  each  source  9  €  A  Is 


stationary.  A  unifilar  Markov  source  is  stationary  only  if  it  is  in  its 
steady-state  distribution.  We  do  not  wish  to  assume  that  the  sources  are 
in  their  steady-state  distributions  since  we  are  interested  in  applying  these 
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codes  to  sources  with  slowly  varying  probabilities  p0.  For  this  reason 
the  codes  which  we  construct  are  universal  with  respect  to  the  initial 
state  of  the  source. 

Let  0  be  a  unifilar  Markov  source  with  alphabet  A  *  (1,2,..., jJ  and  a  set 
of  states  J  »  {l,2,...,s}.  The  properties  of  the  source  9  are  given  by  an 
initial  state  sQ  and  a  pair  of  J  X  S  matrices  PQ  «  (Pq(x|s)}  and  fq  “  (*»*)} 

where  Pg(x|s)  is  the  probability  that  letter  x  is  output  when  the  source  is 
in  state  s,  and  fg(x,s)  is  the  state  into  which  the  source  moves  following 
this  event.  The  probability  of  a  string  x  *  which  starts  with 

the  first  output  letter  is 

k 

p0(s) "  ill  V^K-i5  (125) 

where 

8t  "  fe(xi-i’si-i>  •  (126) 

If  £  “  (xnri.l*-“*xnH-k)  then 

pQ(x)  -  P*(jlm,s0)  ^  P^x^js lm}) 

where  Pg(j|m,SQ>  is  the  probability  of  being  in  state  j  after  m  steps 

if  the  initial  state  is  sQ,  ®q  “  j.  and  s^  -  f  (x^^.s^^) ,  i  -  l,...,k-l. 

He  assume  that  A  is  the  class  of  all  unifilar  sources  with  a  given 

alphabet  A,  state  space*',  and  transition  matrix  F0 .  (So  F0  1*  the  same  for 

all  9  €  A.)  A  source  9  €  A  is  then  specified  by  an  Initial  state  sQ  and 

a  transition  probability  matrix  P0.  The  sources  in  A  are  not  stationary  but 

the  quantities  defined  in  (113)-(116)  are  valid  if  we  assume  that  x  ■ 

(x.,...,x  )  is  the  first  block  of  n  source  outputs  so  that  pa(x)  is  given 
i  n  tj 

by  (125)  and  (126). 
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For  VL-FL  codes  there  are  other  difficulties.  First  3C(T,0)  depends  on 

the  code  T,  so  the  code  with  the  smallest  redundancy  r  (T,0)  does  not 

n 

necessarily  have  the  lowest  rate  Rn(T»9)*  For  oemoryless  sources  3C(T, 8)  is 
the  entropy  of  the  source,  so  it  is  independent  of  T.  This  is  not  the  case 
for  unifilar  Markov  sources.  In  fact,  even  if  a  source  is  in  its  steady 
state  distribution  before  the  first  string  of  source  outputs  is  encoded,  it 
need  not  be  afterwards.  The  VL-FL  code  Induces  a  distribution  on  the  states. 
However,  we  may  show  that  the  lower  bound  of  (121)  is  Independent  of  T  in  the 
following  sense. 

Let  (8^,  i*l,2,...,s)  be  a  set  of  sources  in  A  with  transition  proba¬ 
bilities  p.  ■  pa  such  that  0.  has  initial  state  i.  Suppose  that  some  set  of 
0^  0  1 

S  codes  with  encoding  sets  I\  achieves  3C(r1>0i)  i-l,...,S.  Then  from  the  Kraft 
inequality  and  the  fact  that  for  any  cp  6  A 

-£pe  (x)log  p  (x)  ^  -Epe  (x)log  pe  (x) 
i  y  i  i 

with  equality  if  and  only  if  pfl  (x)  ■  p  (x) »  the  length  of  the  codeword 

i  9 

for  x  6  must  be  -log  pg  (x) .  (Mote  that  this  set  of  codes  is 
VL-VL.)  Now  if  we  wish  to  determine  the  total  length  ef  the  codewords  vsed 
to  encode  a  block  z  of  m  consecutive  outputs  with  this  set  of  codes  the 
problem  is  that  the  end  of  the  block  z  may  be  in  the  middle  of  an  encoded 
string.  However,  due  to  the  structure  of  the  codes  this  problem  may  be  re¬ 
solved  by  dividing  codewords.  Suppose  that  one  encoded  string  x  has  k  letters 
within  z_  and  4(x)-k  outside  z .  The  length  of  the  codeword  for  x  is 
-log  pg  (x)  and  since 

\ 
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*(*> 

\<s>  *  ^  Pe 

k  i(x) 

“  jSi  p9  pe  (xj'8j-i)’ 


(127) 


(128) 


the  part  of  the  codeword  due  to  letters  within  z  is  -log  II  p_  (x,|s,  .), 

j-1  0  J  J"1 

independently  of  the  following  symbols.  So  the  (not  necessarily  Integer) 
number  of  bits  used  to  encode  £  with  initial  state  3q  is 


-I  1°8  Pg(x(  ^  1  s^)  -  -log  pQ(z) 


(129) 


where  z_  is  encoded  as  x^,x^ . x^  (x^  is  not  necessarily  an  entire 

encoded  string),  and  p0(xis)  is  the  probability  of  x  for  source  0g.  The 


expected  rate  of  this  set  of  codes  is 


-m’1  l  m  Pq^Is0)1o8  Pq^180)  “  W 

z  €  A  J 


(130) 


where  8^  *  (pg.Sg).  So  a  set  of  codes  which  achieves  IH'G'^.S^)  for  all 
initial  states  achieves  the  m-th  order  entropy  given  any  initial  state. 


If  we  have  a  VL-FL  code  f  such  that 


Rn(T*,ei)  S3C(T*,ei)  +  c 


(131) 


where  c  is  Independent  of  i,  then  the  expected  rate  of  this  code  over  m 
outputs  is 


Rm(9J>  2  W  +  * 


(132) 
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from  (130)  since  «  is  simply  an  extra  per  letter  redundancy.  So  given 
the  bound  of  (131)  which  depends  on  the  set  T*  we  may  derive  a  performance 

1c 

bound  (132)  which  Is  Independent  of  T  . 

If  a  sequence  of  codes  Ta  Is  mlnlmax  universal  then  from  (132)  and  the 
fact  that  A  contains  sources  with  all  possible  and  initial  states  Sq, 
the  rate  of  these  codes  approaches  the  m-th  order  entropy  If  we  average  as 
above.  So  if  a  VL-FL  code  T  has 

*n(T)  -  c  (133) 

and  a  FL-VL  code  l~  has 
n 

r~(A~)  -  e  (134) 

n  n 

then  the  two  codes  have  approximately  the  same  rate  when  averaged  over  a 
block  of  source  outputs.  A  FL-VL  code  and  a  VL-FL  code  with  the  same 
number  of  codewords  (due  to  the  lack  of  structure  in  the  codes  the  number  of 
codewords  is  a  good  measure  of  complexity)  have  blocklengths  n  and  nlogj 
respectively.  So  if  we  wish  to  compare  codes  of  the  same  complexity, 
then  we  should  compare  the  performance  of  a  blocklength  n  FL-VL  code  to  that 
of  a  blocklength  nlogj  VL-FL  code. 

’it  it 

In  [14]  a  delay  parameter  d  is  defined  by  d  *  n  for  FL-VL  codes  and  by 
d*  -  infCZQ(T):  9  €  A}  (135) 

M 

for  a  VL-FL  code  F.  The  minimax  redundancy,  denoted  ftj,(d)  and  Rv(d)  for 
FL-VL  and  VL-FL  codes  respectively,  is  defined  as  the  minimum  of  rn(XQ) 
or  £n(T)  over  all  codes  whose  delay  d*  does  not  exceed  d.  This  may  seem 
somewhat  unnatural,  but  the  number  of  codewords  is  approximately  the  same 
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for  all  codes  with  delay  d  ,  so  this  approach  leads  to  the  same  comparison 

it 

as  that  mentioned  above.  We  show  this  as  follows.  First  since  d  -  n 
for  FL-VL  codes  we  have 


Rp(n)  -  ftF(n)  . 


(136) 


However,  fty(n)  and  fty(n(logJ)"*)  are  not  quite  the  same.  Any  blocklength  n 


VL-FL  code  T  satisfies 
n 


d*  S  n(logJ)~* 


(137) 


if  A  includes  the  source  0  which  has  all  letters  equiprobable  in  all 
states.  This  is  because  the  entropy  of  0*  is  logj.  So  we  have 


fty(n)  ^  Bv(n[logjri)  . 

Further,  if  T*  achieves  a  minimax  redundancy  fty(n)  then 

ie(Tn)  >  nPC(Tn,0)+Rv(n)]’1 


(138) 


(139) 


d*  ^n[logJ  +  ft  (n)l”1 


This  implies 


By(n)  £  #v(n[logJ  +  fty(n)]“  ) 


(140) 


Since  fty(n)  is  0(n  log  n),  we  have 


By(n)  *»  Ry(n[logJ]”i)  . 

So  any  bound  on  fty  may  be  used  to  derive  a  bound  on  By. 


(141) 


t 
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There  are  a  number  of  papers  with  results  on  the  mini max  redundancy  of 
FL-VL  codes.  In  [15]  and  [16]  asymptotic  upper  and  lower  bounds  are  derived 
for  unlfllar  Markov  sources  which  show 


R_(n)  ■  %  n’^(J-l)S  log  n  +  0(n  S  . 
r 


These  results  are  only  asymptotic,  however,  as  the  0(ri  )  term  is  not 


(142) 


evaluated.  An  upper  bound 


R_(n)  <  h  n-1(J-l)S  log  n  +  Kn"1 


(143) 


Is  derived  In  [12]  and  the  constant  K  Is  given  explicitly.  For  memorylesa 


sources  a  lower  bound  Is  derived  In  [5]  which  shows  that 


-  K'n*1 


R  (n)  *  \  n  ‘(J-lHog  n  -  K'n 
F 


(144) 


Again  the  constant  K1  Is  determined. 

There  are  fewer  results  for  VL-FL  codes.  Lawrence  [17]  derives  a 
universal  VL-FL  coding  technique  for  binary  memoryless  sources  which  has 


r  (T  )  :£  n’1  log  n  +  K"n"^  . 

n  n 


(145) 


(This  bound,  however,  does  not  appear  in  the  paper.)  In  [14]  results  of 


Khodak  are  mentioned  which  state  that 


R  (n)  ■  0(n  log  n) 


for  memoryless  sources.  In  the  next  section  we  show  that 


Rv(n)  ^  (log  J)n“*[%  S(J-l)  +  l]log  n  +  Kn’1 


for  unlfllar  Markov  sources. 


(146) 


(147) 


2.2.  Universal  VL-PL  Code  Construction 


First  we  Introduce  the  optimal  VL-FL  coding  procedure  (Tuns tail's 
algorithm  [18])  for  memoryless  sources.  Let  6  be  a  discrete  memory less 

source  with  alphabet  A  -  (l,2 . j}  and  let  pQ(x)  ■  P{x«x},  x  €  A  be  the 

probability  that  the  letter  x  is  output.  A  VL-FL  code  maps  a  variable  number 
of  source  outputs  into  a  fixed  number  of  code  symbols  from  an  alphabet  C.  We 
will  assume  that  C“£o,l),  i.e.,  that  the  code  is  binary.  Tunstall's  algorithm 
generates  a  rooted  tree  whose  terminal  nodes  (leaves)  correspond  to  code¬ 
words.  There  are  J  branches  leaving  each  non-terminal  node,  and  these 
branches  are  labelled  with  the  J  source  symbols.  The  encoding  procedure 


consists  of  starting  at  the  root  node  and  traversing  after  each  source 
output  the  branch  with  the  corresponding  label.  When  a  leaf  is  reached, 
the  codeword  assigned  to  that  leaf  is  sent  and  the  procedure  is 


repeated.  So  each  leaf  corresponds  to  a  unique  string  x  *  (x. , . . . ,x.  )  of 

k  -IK 

source  outputs  and  has  probability  Pq(x)  ■  II  p^Cx^) .  The  algorithm 


generates  a  larger  optimal  tree  from  a  smaller  one  by  adding  J  branches 


to  the  tree  at  the  leaf  with  the  highest  probability.  So  the  highest 


probability  leaf  is  divided  into  a  set  of  J  leaves.  It  is  easily  seen  that 
the  ratio  of  the  lowest  probability  leaf  in  the  tree  to  be  highest  is  not 
less  than  a  &  min(Pg(x):  x  €  Z] . 

In  Figure  6  this  procedure  is  illustrated  for  a  binary  memoryless 
source  with  p(l)  *  .7  and  n  ■  2  (4  leaves).  The  encoding  tree  is  formed  in 
three  steps  with  the  most  probable  leaf  being  extended  at  each  step.  Each 
of  the  final  set  of  input  strings  T  is  assigned  a  codeword  of  length  2. 


a.  2  leaves 


1 


Figure  6.  Construction  of  a  Tunstall  code  with  blocklength  2  for 
binary  memoryless  source  with  p(l)  ■  .7. 
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We  now  extend  this  algorithm  to  coding  for  unifiler  Markov  sources. 

A  VL-FL  code  for  e  unifiler  Markov  source  3  is  generated  using  an  algorithm 
much  like  that  for  memoryless  sources  except  that  now  each  node  of  the  tree 
has  a  state  associated  with  it.  The  probabilities  associated  with  the 
branches  are  given  by  Pg(xjs)  where  x  is  the  output  letter  which  labels  the 
branch  and  s  is  the  state  of  the  node  which  the  branch  is  leaving.  The 
algorithm  again  consists  of  extending  the  most  probable  node,  where  this 
probability  is  now  given  by  the  product  of  the  transition  probabilities  of 
the  branches  traversed  in  reaching  the  node.  It  is  not  clear  that  this 
algorithm  is  optimal  since  in  general  to  actually  encode  some  block  of  source 
outputs  S  encoding  fees  are  necessary,  each  designed  for  Pg  and  Fq  but  for 
different  initial  states.  'The  structure  of  each  of  these  S  trees  determines 
the  probability  of  being  in  a  particular  state  after  encoding  a  string  of 
source  outputs,  hence  the  probability  that  a  particular  tree  is  used  is 
affected  by  the  structure  of  all  S  trees.  It  is  not  necessarily  true  that 
generating  these  trees  independently  (as  is  done  here)  is  the  optimal  encoding 
algorithm.  However,  the  algorithm  does  yield  code  trees  which  have 
asymptotically  good  performance  as  will  be  seen  later.  Further,  in  each  tree 
the  ratio  of  the  minimum  probability  leaf  to  the  maximum  is  not  less  than 
1191 

a  -  mintpg (xjs) ;  x  €  A,  s  €  .  (148) 

We  use  the  TUnstall  algorithm  for  individual  unifilar  Markov  sources 
to  construct  a  universal  code  for  a  class  A  of  sources  as  follows.  Let 
§_  •  [<p. ;  i  ■  1,...,Y«)  he  a  finite  subset  of  A  such  that  if  ®  €  <  ,  then 
there  is  a  source  <pj  €  with  initial  Jtate  j  which  has  the  same  transition  proba¬ 
bility  matrixes  m,  for  1  ■  1,2, . . .  ,S.  Let  be  the  encoding  set  of  a  blocklength 

BX 
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m  code  designed  for  the  i-th  source  9.  €  i  .  The  codes  r*  are  constructed 

1  in  n 

it 

using  Tunstall's  algorithm  as  above.  The  universal  code  T  is  defined  as 

11 

follows  (n  is  defined  in  terms  of  a  in  (150)  below).  A  string  x  is  an 
element  of  T*  if  x  €  for  some  i  and  (x  *  jr)  4  for  any  j  ■  1,2 . Ym» 

where  £  is  a  non-empty  string  of  source  letters  (*  represents  concatenation). 

*  (i) 

So  the  tree  for  T  contains  all  nodes  from  the  trees  for  f'  ,  i  ■  l,2,...,v  . 

n  m  m 

Now 


(149) 


so  the  strings  in 


n 


may  be  encoded  into  codewords  of  length 


n  S  m  +  Tlogy  1  .  (150) 

m 

fg 

The  rate  of  this  code  when  applied  to  a  source  3  is 

Rn(T*,9)  -  n[2e(r*)]_1  (151) 

<  [m  +  TlogyJ  ][Ae(T^k))r1  (152) 

for  any  k  *  l,2,...,Ym.  This  follows  because  by  its  construction  the 

★ 

expected  length  of  the  sequences  in  Tn  must  be  at  least  that  of  any  of 

the  sets  There  are  two  sources  of  redundancy  in  (152)  which  we  must 

m 

bound  in  order  to  bound  the  redundancy  of  the  code  r*.  The  first  is  the 

Tlog  v  *1  term  which  is  due  to  the  fact  that  |r*l  may  be  as  large  as  Y_2m. 
m  *  n  in 

The  second  factor  is  the  difference  between 


maxt-tgO^10);  i  -  l . yj  (153) 

and  JQ(f  )>  where  T  is  the  Tunstall  code  designed  for  source  3.  So  the 
w  in  a 

second  factor  is  derived  from  the  mismatch  between  the  actual  source  8  and 
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the  source  for  which  the  code  was  designed  (chat  source  being  some 

A 

cp.  €  4  ).  As  y  Increases  Che  effect  of  the  first  factor  Increases  and 

1  B  SI 

that  of  the  second  decreases.  Blumer  [13]  shows  that  for 

log  Y«  *  %  S(J-l)log  m  +  K  (154) 

SI 

where  K  does  not  depend  on  m,  a  set  may  be  constructed  such  that 

max{mlnCKr(8 ;tp^) :  1  ■  l,2,...,Yn):  0  €  A)  ^  m  *  .  (155) 

(Here  3Cr(6;cp)  Is  the  entropy  of  source  8  relative  to  source  cp.)  We  use 
this  result  to  bound  the  effect  of  the  mismatch.  The  details  of  the 
derivation  are  In  Appendix  B.  The  final  result  Is 

r  (T*)  <  n‘l  log  J[log  n  +  %  S(J-l)log  n]  +  K.n"1  (156) 

u  n  i 

? 

for  n  >  1^ (log  n)  where  and  Kj  are  constants  Independent  of  n  and  8 
given  In  Appendix  B,  equations  (B.34)  and  (B.35).  So  the  code  Is  mlnlinax 
universal . 

As  previously  discussed,  we  wish  to  compare  the  performance  of  a 
blocklength  n  VL-FL  code  to  that  of  a  blocklength  n[log  Jl"1  FL-VL  code 
(these  codes  have  the  same  number  of  codewords).  For  FL-VL  codes  (143) 

gives 

ft  (n[log  J]"1)  <  n"1  log  S(J-i)log  n]  +  Kn"1  (157) 

f 

and  (156)  Implies 

»v(n)  S  n”1  log  J[log  n  +  %  S(J-l)log  n]  +  I^n"1  (158) 
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•o  we  see  thee  the  leading  term  In  the  redundancy  bounds  is  the  same  except 
for  a  log  n  term  which  appears  in  the  VL-FL  bound.  This  additional  term 
Is  present  because  there  is  no  known  bound  on  the  redundancy  of  a  Tuns tall 
code  which  remains  finite  as  the  minimum  letter  probability  of  the  source 
approaches  zero.  If  the  sources  have  all  letter  probabilities  greater  than 
some  c  >  0,  then  the  log  n  term  is  replaced  by  log  c”*. 

Further  if  A  is  the  class  of  binary  memoryless  sources  (so  S  ■  1  and 
A  *  {0,l}),  then  the  log  n  term  may  be  eliminated.  The  final  result  for 
this  case  is 

fty(n)  <  \  n"1  log  n  +  i^n”1  .  (159) 

The  derivation  of  this  result  appears  in  Appendix  B. 

2.3.  Performance  Evaluation  for  Binary  Memoryless  Sources 

In  this  section  we  construct  VL-FL  codes  for  the  class  of  binary 
memory less  sources  using  the  method  presented  in  Section  2.2,  and  compare 
their  performance  to  the  performance  of  the  FL-VL  codes  constructed  in  [11]. 
One  modification  to  the  basic  code  construction  is  given,  and  the  performance 
of  codes  obtained  from  this  modification  is  evaluated.  Here  J  «  2  so 
n  log  J  ■  n  and  we  must  compare  VL-FL  codes  to  FL-VL  codes  of  the  same 
blocklengths . 

One  difficulty  which  arises  in  designing  a  VL-FL  code  of  blocklength 

if 

n  is  that  we  do  not  know  apriorl  the  cardinality  of  for  a  given  m.  We 
only  have  the  upper  bound  of  (149).  So  to  actually  construct  a  blocklength 
n  VL-FL  code  we  use  the  following  iterative  procedure.  We  choose  an  initial 
number  N  of  codewords  for  the  Tuns  tall  codes  1*^  which  are  designed  for 
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sources  In  9  (here  m  A  log  N  Is  not  necessarily  an  integer).  We  then 
x& 

if 

combine  these  codes  into  a  single  code  r  (N) .  We  Iterate  this  procedure 
to  find 

N  A  max  £n:  |r*(N)|  <  2n}  .  (160) 

So  N  is  the  maximum  number  of  codewords  in  the  Tunstall  codes  such 

in 

that  the  combined  code  has  blocklength  n.  Then  we  set 

r*  -  r*(N)  .  (i6i) 

If  we  let  the  parameter  6  for  a  binary  memoryless  source  be  the  probability  of 

a  one)  then  the  class  of  binary  memoryless  sources  is  the  Interval  [0,1]. 

Because  A  is  one-dimensional  we  may  easily  determine  the  optimum  design 

point  set  for  any  Ym*  These  sets  are  given  for  some  values  of  Ym  In 

Table  1  of  [11]  and  may  be  determined  for  other  Ym  using  the  technique 

described  there.  Codes  of  blocklengths  5,  8,  and  10  were  constructed  using 

these  sets  $  .  A  graph  of  the  redundancy- of  these  codes  is  given  in  Figure  7. 
m 

The  curves  are  symmetric  about  8  *  .5.  In  Table  1  the  maximum  redundancies 
are  compared  to  those  of  theFL-VL  codes  of  [11].  We  see  that  VL-FL  codes 
have  significantly  better  performance  for  blocklengths  8  and  10,  and  only 
slightly  worse  for  blocklength  5.  In  Figure  8  we  have  graphed  the 
redundancy  of  blocklength  8  and  10  VL-FL  codes  together  with  FL-VL  codes  of 
the  same  blocklengths .  The  VL-FL  codes  have  lower  redundancy  for  almost  all 
values  of  9.  The  largest  difference  occurs  at  9  ■  0  or  1.  The  reason  for  this 
is  that  in  any  universal  FL-VL  code  the  codewords  for  the  all  zeros  and  all 
ones  output  blocks  must  have  length  at  least  two,  hence  the  redundancy  at 
9  ■  0  or  1  must  be  at  least  2n”^. 


Redundancy 
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The  lack  of  structure  in  these  codes  typically  requires  a  table 
lookup  scheme  for  decoding,  so  their  complexity  increases  as  2n,  the  total 
number  of  codewords.  This  limits  n,  and  thus  the  achievable  redundancy  is 
also  limited.  To  alleviate  the  problem  of  complexity  we  may  adopt  the 
following  modified  procedure.  We  design  Tunstall  codes  of  blocklength  m 


for  the  sources  cp.  6  5  .  Instead  of  combining  these  codes  we  leave  them  as 
j  m 

separate  "subcodes".  Then  we  encode  the  source  outputs  (Xq,x^,...)  as 
follows.  Each  subcode  encodes  k  strings  from  the  source  output.  We  use 


the  subcode  which  has  the  lowest  rate  for  this  set  of  k  strings,  i.e.,  the 


one  which  encodes  the  largest  number  of  source  outputs.  The  codeword  for 

this  set  of  k  strings  is  the  concatenation  of  a  prefix  of  length  flog  yl 

which  identifies  the  subcode  we  are  using  with  the  k  codewords  for  the  encoded 

strings  from  that  subcode.  The  total  number  of  codewords  for  this  procedure  is 

Y  2m.  The  resultant  blocklength  is  approximately  km  and  would  require  about 
m 

Y^*01  codewords  in  the  original  coding  procedure.  So  the  complexity  of  this 
blocklength  km  code  is  approximately  that  of  the  blocklength  m  code  previously 
considered.  The  reason  that  this  new  code  performs  better  than  a  blocklength  m 
code  is  that  the  redundancy  due  to  combining  the  codes  is  of  order 

m~Hog  m,  whereas  the  other  terms  in  the  redundancy  are  of  order  m  H  With 
the  new  procedure  these  terms  are  (km)  Hog  m  and  m  *  respectively  so  that 
the  dominant  term  is  reduced  with  respect  to  the  other  terms. 

A  similar  technique  is  used  in  [11]  for  longer  blocklengths .  A 


special  code  is  used  there  for  source  with  9  near  0  or  1,  but  the  complexity 
remains  about  the  same.  In  Table  2  we  give  the  maximum  redundancies  of  VL-FL 
codes  of  blocklengths  50,  80,  and  100  which  are  constructed  by  encoding  10 


strings  with  VL-FL  subcodes  at  blocklengths  5,  8,  and  10.  Results  from 
[11]  for  the  same  blocklengths  are  Included  for  comparison.  The  FL-VL 
codes  perform  a  little  better ,  but  there  is  no  great  difference. 
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CHAPTER  3 

UNIVERSAL  CODING  FOR  REAL-VALUED  SOURCES 

3.1  Introduction 

Here  ve  consider  source  coding  for  discrete-tine  reel-valued  sources. 

The  source  output  for  the  1-th  time  Interval  is  a  real  random  variable  X^. 

In  contrast  with  the  previous  chapter,  the  entropy  of  these  sources  is 
generally  infinite,  so  noiseless  source  coding  is  not  possible.  The  problem 
here  is  one  in  rate-distortion  theory,  so  the  goal  is  to  find  a  code  with 
low  distortion  for  a  given  rate.  We  assume  that  we  have  a  distortion 
measure  dQ(x,£)  for  each  positive  integer  n,  where  x  and  £  are  elements 
of  lRn ,  and  that  there  is  a  maximum  distortion  5  <  °*  such  that 

n_1  V*.Z)  <  6*  ~’L  S  ^  (162) 

for  all  n.  There  are  a  number  of  papers  on  coding  techniques  for  known 
sources  of  this  type,  e.g.,  [25],  [28],  and  [29].  For  some  specific  classes 
of  sources  ve  show  how  a  code  for  an  entire  class  may  be  constructed  using 
a  coding  technique  for  single  sources  in  A.  We  show  that  asymptotically 
this  code  performs  as  well  on  any  source  9  €  A  as  a  code  designed  specifically 
for  that  source. 

The  codes  which  we  consider  are  fixed  rate;  that  is,  all  codewords  have 
the  same  length.  The  codes  consist  of  vector  quantization  followed  by  a 
mapping  of  the  quantizer  outputs  into  fixed  length  binary  sequences.  A  block- 
length  n  M-point  vector  quantizer  is  a  mapping  fn:  IRn  -*■  A  where  A  " 

{a^:  i»l,...,M}  is  a  finite  set  with  elements  in  lRn  .  The  elements  of  A 
are  called  output  levels.  The  distortion  which  results  when  the  outputs  of 
a  given  source  9  are  quantized  is  defined  as 


64 


D(fn;«)  -  n-1  E^[dn(X,fn(X))]  (163) 

where  d  (•»•)  Is  Che  distortion  measure  and  X  -  (X, , . . . ,X  ).  The  rate  of 
n  —  J.  n 

quantizer  f  for  source  A  is  defined  as 

R(fn;fl)  "  n  riog  M_1  *  (164) 

For  our  purposes  a  code  is  determined  by  its  associated  quantizer  fQ,  so 

ve  refer  to  the  code  as  f  .  Then  the  rate  and  distortion  of  the  code  f 

n  n 

when  applied  to  source  0  are  defined  by  (164)  and  (163)  respectively. 

We  assume  that  we  have  a  coding  technique  for  sources  in  a  class  A. 

So  for  any  source  J  6  Awe  may  construct  a  blocklength  n  code  f^  with  M 

output  levels.  For  a  given  blocklength  n  and  rate  R  let  8  (A)  be  the 

n,K 

distortion  achieved  by  f^.  We  assume  that 

n 

«  _(A)  -  D(f®;A)  <  D(i*;A)  (165) 

n,K  n  n 

e 

for  A,<p  €  A.  So  when  applied  to  source  0,  perform a  at  least  as  well  as 

a  code  designed  for  some  other  source  in  A.  The  coding  technique  here 

does  not  necessarily  yield  optimal  codes:  that  is,  8  _(A)  need  not  approach 

n,K 

the  distortion-rate  function  D(R)  for  source  A  as  n  +  For  example,  these 

codes  may  be  derived  from  locally  optimal  quantizers  (designed  using  the 

algorithm  of  [25])  or  from  optimal  one-dimensional  quantizers  [29]  . 

For  some  specific  classes  A  we  show  that  given  a  coding  technique  we 

*  * 

may  construct  a  sequence  of  codes  of  increasing  blocklength  f^.f^,...  such 
that 

D(f*;A)  -  «n  „<«)  -»•  0 
n  n,K 

and 

R(f*;«)  -  R 


.xV 


(166) 
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uniformly  on  A  as  n  -*■  We  cell  such  e  sequence  o£  codes  minlmax  universal 
with  respect  to  the  coding  technique  which  yields  6  (0 ) .  It  is  important 

U)K 

to  note  that  6  _(0)  is  the  distortion  achieved  by  many  different  codes, 
n,K 

each  designed  for  a  particular  (  6  A,  In  contrast  to  this,  D(fQ;*)  is  the 
distortion  of  a  single  code  over  the  class  A. 

First  we  consider  classes  of  memoryless  sources.  A  general  result  is 
derived  for  classes  which  are  twice-differentiable  with  respect  to  their 
paramters  9 .  This  result  is  in  terms  of  an  integral  which  is  evaluated  for 
some  specific  classes.  For  all  of  these  classes  the  result  is  that 

»<f>>  -  8n,R(<)  *  V1 

and 

R(f*;0)  -  R  <  kn"1  log  n  +  J^n-1  (167) 

where  k  is  the  dimension  of  A  and  and  are  constants.  We  then  show 
that  a  result  of  the  same  form  holds  for  k-th  order  Gaussian  autoregressive 
sources.  These  codes  give  upper  bounds  on  the  additional  rate  and  distor- 
tion  incurred  when  coding  for  a  class  A  rather  than  a  specific  source 
9  €  A. 

An  outline  of  the  code  construction  and  bounding  of  performance  is  as 

follows.  For  each  integer  n  we  have  a  finite  subset  of  sources  in  A. 

Codes  for  each  source  in  are  constructed.  These  codes  are  then  combined 

n 

into  a  single  code  by  adding  a  prefix  to  each  codeword  which  identifies  the 

source  in  ♦  for  which  the  code  is  designed.  The  rate  of  the  resultant 
n 

code  is  greater  than  the  rate  of  the  individual  codes  because  of  this  prefix. 
The  code  has  low  distortion  for  the  sources  in  but  may  not  for  sources 
which  are  not  in  ♦  .  As  the  number  of  sources  in  ♦  increases,  the  additional 
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rate  Increases  and  Che  discorcion  decreases.  The  firsc  result  bounds  Che 
mismatch  dis Core ion,  i.e. ,  che  discorcion  which  results  when  a  code 
designed  for  one  source  is  applied  Co  anocher,  in  ceres  of  che  encropy 
of  one  source  relacive  Co  Che  ocher.  Next  we  show  ChaC  if  Che  relacive 
encropy  may  be  bounded  Chen  we  may  pick  4^  Co  give  a  minimax  universal 
code.  The  relacive  encropy  is  then  bounded  for  some  classes  of  memoryless 
sources  and  finally  for  Gaussian  auCoregressive  sources. 

3.2  Code  ConsCrucCion 

We  design  codes  f ^  of  blocklength  n  and  race  R  for  sources  €  4^, 

where  4  m  {<p.:  i«l,...,y  }  is  a  subset  of  A.  These  codes  Cake  n-tuples  of 
n  l  n 

source  ouCpuCs  into  codewords  of  length  '"log  ,  where  H  is  che  number  of 

levels  in  the  associated  vector  quantizer.  These  7q  codes  are  Chen  combined 

by  adding  a  ^log  -bit  prefix  to  each  codeword.  We  denote  this  combined 
* 

code  f  .  We  know 
n 

<’»!>  ■  <16« 

>ir  i  * 

where  R  ■  n  log  M  .  The  encoding  procedure  for  fR  is  as  follows.  Sec 

f*(x)  -  f<1}(x)  (169) 

n  n 

for  i  *  arg  min  d  (x,f^(x)).  Then  the  codeword  for  f  (x)  is  the  codeword 
,  n  —  n  —  n  — 

(±\  1 

for  f^  (x)  with  a  prefix  attached.  So  we  have 

D(£*;9)  <  min  D(f(i);0);  (170) 

n  i  n 

* 

Chat  is,  the  distortion  of  f  for  a  source  9  is  no  greater  than  the  distor¬ 
tion  for  any  one  of  the  codes  f^  from  which  it  was  constructed.  The  rate 

of  f*  is 
n 
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R(f*;«)  -  n'1^ log  m"1  +  riog  tJ1)  .  (171) 

Now  if  j  »  arg  min  D(f^;d)  we  have 
i  n 

■>«>>  '  Sa,R<*>  +  -  «,.„<•>  . 

(172) 

Let  £  by  the  code  designed  £or  6 .  Now 
n 

*  8„,r<V  <  “<*.»,>  <173> 

so  we  have 

D(f*;0)  -  «n>R(fl)  <  tD(f^J);«)  -  D(£^);^j)l  +  BXf^)  -  D(fn;«)]. 

(174) 

The  set  $n  is  designed  such  that  the  right  hand  side  of  (174)  may  be 
bounded  uniformly  for  0  €  A.  Both  of  these  terms  are  distortion  mismatch 
terms,  that  is,  they  represent  the  distortion  incurred  when  a  quantizer 
designed  for  one  source  is  used  for  another.  The  following  theorem 
bounds  the  distortion  mismatch  in  terms  of  the  relative  entropy. 

Theorem  3.1.  If  f  is  a  code  and 
-  n 

3Cn(9;<p)+3Cn(9;9)  <  5  (175) 

then 

|D(fn;9)  -D(fn;<p)l  ;£  Is  D(2  log  e)“*  (176) 

where  3CQ(9;9)  the  n-th  order  entropy  of  6  relative  to  9  (30], 

A  „  MBi>  , 

Kn(0  ;9)  "  *8(Bi)l0*  £7(17) 

1  B  1 


(177) 
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where  the  supremum  ie  over  ell  finite  pert;-  *)M  (Bj)  of  Bn  end  (B)  le 
the  probability  thet  the  source  output  x  €  B.a  is  In  B  for  source  0. 

Proof.  See  Appendix  C. 

k 

Now  suppose  thet  A  Is  e  compact  subset  of  XL  .  The  following 
corollery  bounds  the  rate  end  distortion  of  e  universal  code  using 
Theorem  3.1.  For  8  -  (8^,.. .,8^)  and  if  •  we  define  the  norm 

118  - «11  -  oext|8i-tihl^  >0  •  (178) 

Corollery  3.1.  If  j| 6  -  til  —  n  ^  implies  KQ(8;f)  i  for  8,$  €  A  then  e  code 

f*  may  be  constructed  such  that 
n  J 


|D(fJj;0)  -6nj6(R)|  ^  n_1Di^(log  e)'** 


and 

R(f*;0)  -  R  < 
n 

for  all  0  €  A  where 


-1 

n 


k 

[k  log  n  +  l+  E  log(X . 

1-1 


+  n-1)]  . 


(179) 


(180) 


1  6  max  |0.  J  1  -  1 . k  (181) 

8,f€A 

is  the  maximum  difference  in  the  i-th  components  of  8  and  %  for  any 
0,$  €  A. 

Proof.  We  cover  A  with  cubes  of  size  n  and  then  let  f  consist  of  one 

n 

source  from  each  of  these  cubes.  There  are  at  most 

k 

n  ne.nn  (182) 

l-l  1 

such  cubes  which  gives  (180)  and  clearly  for  any  6  €  A  there  exists  a  source 
cp  €  8  such  that  ||8  -  <pj|  S*  n  ^  so  (179)  follows  from  Theorem  3.1. 
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Let  A  be  e  class  of  meaoryless  sources.  Since  A  is  a  subset  of  RK 

a  source  9  6  A  is  specified  by  It  parameters  {0^,i-l . k>,  9^  €  R,  end 

we  write  9  -  {9  ^ . 0^}.  If  9  has  a  density  ptf ,  then  we  assume  that 

1  A  a 

pe  "  is.  pe 


and 


ij  a _ 

p0  S0t  d6j 


(183) 


exist  for  all  i,j  -  For  memoryless  sources 


P«(x)  -  n  p.  (x) 
i-1 


so 


Xn(0;*)  -3^(0;*) 


(184) 


For  such  sources  we  have  the  following  theorem. 


Theorem  3.2.  If  ||  0-^||  <  e  then 


3(^(0;*)  <  k2K*e2 


(185) 


where 


*  A  ff  pJ(*) 

K  -  (log  e)  sup  |l  ^ —  dx:  <fi  €  A,  i,j  «1,. 


.  ,k| 


(186) 


for  A  C 


Proof.  This  follows  directly  from  Taylor’s  formula.  We  know  [301 

,  Pe(5) 

*Lnp0^)lo8r(irds 


SO 
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* 


K 


(192) 


Next  consider  a  class  of  mixture  distributions.  Let  {qS  i*l, . . . ,k+l}  be 

v 

a  set  of  distributions  on  2R .  Then  define  A  C  ®  by 


A  »  {0 :  p  (x) 


2  9tq1(x),  9.  >  e, 
i-1  1  J 


J*1 • • • . . 


(193) 


Since 


P8(x) 


q1(x) 


and 


1 

e 


we  have 


* 

K 


(194) 


Finally,  consider  the  case  where  A  is  a  compact  set  of  fc-th  order 

Gaussian  autoregressive  sources  [30].  We  assume  here  that  d  (*,*)  is  the 

n 

minimum  of  nD  and  the  r-th  power  of  the  Euclidean  distrance.  We  also 

assume  that  if  f*(x)  ■  a.  then 
n  —  i 

d  (x, a . )  <  d  (x,a.)  ;  j  -  1,2,...,M.  (195) 

n  —  l  n  —  j 

This  means  that  each  source  output  is  mapped  to  the  closest  output  level 
by  the  quantizer,  which  is  a  necessary  condition  for  a  quantizer  to  be 
optimal.  In  a  Gaussian  autoregressive  source  the  output  is  generated  by 
adding  a  Gaussian  r.v.  to  a  weighted  stun  of  the  previous  outputs.  So  the 
j-th  output  is  given  by 
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h  *  I 

+  I  -  *±)  I 

i»l  i  ip*6 

+  S  I  (9  -  *  )  (0  -  *)  sup  {j^r  At  K  («;«')  |  :  «  €  A} 

i-1  j-1  1  1  j  j  6wi  5wj  1  «'-« 

(187) 


NowJC^(0;0)  -  0  and 


_5 
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-3^(0;*)  j  -  |  p^(x)dx  ■  0 


(188) 


So  the  theorem  follows  from 


8^873rJCl(w;W’),t  ”(lo8e) 

1  j 


r  p^x)  p^OO 

pJx) 


dx  . 


(189) 


* 

If  R  is  finite  for  a  class  A  then  the  hypothesis  of  Corollary  3.1  is 

true  and  the  code  is  minimax  universal.  The  performance  of  this  code  is 

given  by  (179)  and  (180)  with  K  *  k  k  .  If  A  is  the  class  of  Gaussian 

2  2  2 

distributions  with  mean  /i  €  and  variance  a  €  [a^  a2J  ,  >  0,  then 

k,  the  dimension  of  A,  is  2,  and  computation  of  the  Integral  gives 

K*  -  max  {1,2c’2}  (190) 

For  the  case  where  A  is  the  class  of  exponential  distributions  with  mean 
8  €  (B^Bjl  *  >  °»  we  have  k"l  and 

K*  •  ~  .  (191) 

*1 

More  generally,  if  A  is  the  class  of  gamma  distributions  with  a  >  0  fixed 
and  B  €  [B^Bjl  ,  then 
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xj  •  <i%> 

2 

where  Z.  ~  71(0, a  )  ere  independent.  We  assume  that  the  sources  are 

2  2  2 

asymptotically  stationary  and  that  a  €  [o^,c\,  ].  This  is  guaranteed  if 
the  zeros  of 

a(X)  “  l  +  a1l"1+  ...+akx’k 

have  magnitudes  less  than  one.  In  vector  form  (196)  becomes 


Xj  -  *  +  Zj 


where  Xj  *  *  — -j  *  (Zj,0, •».,())  «  and 


(197) 


Y 


1 


1 

’*10 


(198) 


So  the  roots  of  a(X)  are  the  eigenvalues  of  y.  A  source  0  6  A  is  determined 

2  2 

by  (a^, . . .  .a^)  and  o  so  we  write  0  -  (dj, . . .  .a^o  ).  Each  0  6  A  must  have 

*15  ^  0;  otherwise,  it  is  not  a  k-th  order  autoregressive  source. 

The  design  procedure  and  the  derivation  of  performance  bounds  are  a 

little  different  here  because  the  initial  state  of  the  source  (X  . X  ) 

-1  -k 

is  an  issue.  As  in  previous  chapters  we  want  the  universal  code  to  perform 
well  for  all  initial  states.  Here  however  the  initial  state  may  lie 
anywhere  in  R  so  this  is  not  possible.  We  must  assume  that  the  initial 
state  lies  in  a  compact  subset  of  R  .  Specifically  we  assume  that 

IX.jl  S  (  <  *  for  j  “  1 . k.  To  construct  the  universal  code  we  design 

codes  for  various  sources  in  A  but  only  for  a  single  fixed  initial  state 
5  m  9  ^  (0,0,..., 0).  We  first  consider  how  a  code  designed  for  source  cp 
with  x°  -  0  performs  when  used  for  source  0.  If  ||0  -qj|  <  c  we  have 
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Kn(0»«p)  ^  %  «2  1°8  e[o^  +  k2a^2  a2]  (199) 

where 


2  a  2 

ax  *  eup  11m  EQ[X  ].  (200) 

0  €  A  n  -•  « 

2 

Now  ax  <  •  since  we  assume  that  A  Is  compact  and  that  the  sources  In  A  are 
stationary.  Details  In  the  derivation  of  (199)  are  given  in  Appendix  C. 
Given  (199)  we  can  bound  the  rate  and  distortion  of  the  code  as  before. 

So  we  have  now  constructed  a  universal  code  for  initial  state  x^  -  0.  We 

»  M 

bound  the  performance  of  this  code  for  other  initial  states  as  follows. 

Given  a  vector  of  source  outputs  X  ■  (Xq . Xn-1^  ^rom  a  source  ® 

with  initial  state  x°  -  (x_k, . . . ,x_x)  we  define  a  vector  x  by 


X±  -  X±  -  (201) 

where 

-  ft1*1*0]!  5  i  *  0, . . .  ,n-l  .  (202) 

2 

The  matrix  ¥  is  as  in  (198)  for  S  -  (a^ . . .  .a^.c  ),  and  [x]j  is  the  j-th 
component  of  the  vector  x>  Then  X  has  the  same  distribution  as  a  vector 
generated  by  source  6  with  3t  *  0.  So  we  know 

Eq  £  dn(X,f*(X))]  <  n  «n>R(9)  +  K’  (203) 

where 

K’  ^  Ota"4  +  k2  <T~2  <£**  .  (204) 


From  (195)  we  have 

d„(X,f*(X))  <  d(X,f*(X)).  (205) 

n  —  n  —  n  —  n  - 

Let  X*  •  f  (X).  Then  the  expected  distortion  A  (unnormalized)  is  bounded 
—  n  — 

by 
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(206) 


Now  from  (205) 


xt-x*  <  (xi-xi)  +  (K±  -  [f *(X)J  i) 


so 


*  2 

A  (W 

% 

< 

X  <V  nfei,)2 

+ 

jf 

fn-i  I 
*/i 

i-0 

_  _ 

i*0 

i-0 

“1*S 


Since  the  eigenvalues  of  Y  are  strictly  less  than  one  Y  -►  Q  as  i  -*•  00 
So  we  may  bound 


n-1  _  o 

2  nf  <  ti 

i-0 


(207) 


where  h  does  not  depend  on  n  or  0.  The  details  of  this  derivation  are 
carried  out  in  Appendix  C.  This  gives 

A  <  Eg  {[(V  -  Ef*(X)  ±]2)h  +  hj^  .  (208) 

If  r  >  1  and  a,b  >  0,  then 

(a  +  b)r  <  ar  +  rb(a  +  b)r-1 
which  implies 

A  <  n  5  _(0)  +  K'  +  rh(D  +h)r_1  .  (209) 

n,K 

So,  the  distortion  mismatch  is  bounded  by 
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lD(fnJ0)  ’6n,R(9)l  <  n"1[K,  +rk(3  +  h)r-A] 


,r-l. 


where  K'  is  defined  in  (204) ,  and  from  (180) 


R(f*;6) -R<n'1((k+l)log  n  +  l+  £  log(b.  +n_1)  +  logp*  ”a?  +  n~1)] 
n  i-i  1  4  1 

where 


bifi  ■“  lai_5tl 

6,cp6A  1  1 

2  2 

using  0  “  (a^, . . .  )  and  cp  *  (a^,»..,a^,a  ).  So  the  code  is  mini  max 

universal.  Notice  that  only  0(n  *)  terms  were  added  to  the  rate  and 
distortion  in  going  from  a  fixed  initial  state  to  an  arbitrary  initial 
state  in  some  compact  set.  Again  the  additional  distortion  is  0(n_1)  and 
the  dominant  term  in  the  additional  rate  is  the  number  of  dimensions  of 
the  class  A  times  n-^  log  n. 


3.3  Generalization  to  Unbounded  Distortion  Measures 

Under  certain  conditions  we  may  remove  the  restriction  that  d-nC  *  *  * ) 
is  at  most  nO,  and  still  get  minimax  universal  codes.  In  particular  this 
may  be  done  if  dR  is  a  different  distortion  measure  which  does  not  Increase 
exponentially  and  if  the  contribution  of  high  distortion  terms  to  the 
expected  distortion  is  small  for  any  0  €  A. 

Let  B(m)  be  a  sphere  in  IRn  with  diameter  «  and  define 

dn CH  x  -  x  ID  "  n"1  dn(x,£) 

where  11*11  is  the  Euclidean  norm.  If  for  all  9  €  A  we  have 

' 

3r(  II  x  II  )  p9  (x)  dx  <  f  (w) 

r  _  y  x  1  C 


(210) 
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where  f(co)  -+  0  as  u  +  *  and  in  addition 

d  *  0  (211) 

n 

as  w  +  then  we  may  construct  universal  codes  as  before.  We  assume 

* 

that  the  quantizer  f  has  at  least  one  output  level  in  B(«) .  If  this  is 

n 

* 

not  the  case  we  may  add  one  output  level  to  f^.  The  effect  which  this 
has  on  the  rate  is  small;  M  is  simply  replaced  by  M+l  in  (171).  so  the 
dominant  term  is  not  affected.  To  bound  the  distortion  we  divide  the  out- 
puts  into  two  sets.  For  the  set  (B(«)]  we  know  that  the  total  expected 
distortion  is  at  most  f (<*>).  For  x  6  B(w)  we  have 

n-1  dn (x ,  f *  (x)  )  <  dQ («)  (212) 

so  we  may  bound  the  distortion  as  before  using  <*n(w)  in  place  of  5.  So 
if  we  set  w  *  log  n  then  the  distortion  here  is  bounded  by 

D(f*;»)  <  n-1  d  (log  n)  K5  (log  e)'*5  +  f(log  n)  .  (213) 

n  n 

* 

So  from  (211)  is  is  clear  that  f  is  mlnlmax  universal. 

n 

If  d  (o>)  does  not  increase  exponentially  with  «  then  all  classes 
n 

considered  here  (except  for  the  mixture  distributions)  satisfy  (210) .  For 

example,  if  dR  is  the  r-th  power  of  the  Euclidean  distance  and  p^ (x)  decays 
-ctx  . 

as  e  then 

f  (w)  »  k  w  e 

so  that 

f (log  n)  -  k  n  a(log  n)r  (214) 

and 

d(log  n)  ■  (log  n)  . 
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APPENDIX  A 

PROOFS  OF  THEOREMS  FROM  CHAPTER  1 

Theorem  1.1.  Given  any  two  Initial  distributions  ^  end  vQ  on  [0,1],  if 
and  v ^  are  generated  £rom  (38)  then 

P(i*1,v1)  S  |xlpOi0,v0)  (A.l) 

where  X  ^  1  -  a  -  (3 . 

Proof.  The  proof  is  done  in  three  stages.  First  we  assume  and  vQ  are 
concentrated  on  individual  points,  then  finite  sets  of  points,  and  finally 
we  generalize  to  arbitrary  p^,vQ. 

Assume  that  Vq  is  concentrated  on  a  point  z*  and  ^  is  concentrated 
on  z*  +  s,  where  z*  €  [0,1)  and  c  6  (0,1-z*].  This  gives  vQ({z*})  -  1, 

M>0({  z*  +  «})  -  1  and  p(M-0»v0)  -  e.  Now  let  and  vL  be  the  distributions 
generated  from  and  Vq  using  (9).  So 

Hit  (  fk(z*+«)}]  ■  p(k|z*  +  «)  (A. 2) 

and 

vx[  {  fk(z*) >]  ■  p(k|z*)  ;  k  €  A  .  (A. 3) 

Since  and  are  one-dimensional,  p(p.^,v^)  is  given  by  [5] 

1 

P  -  J*  }m,1[0,z] -Vl(0,z]|dz  .  (A. 4) 

So  the  p-distance  is  the  area  between  the  cumulative  distributions  for 
p^  and  v ^ •  We  first  show  that  p^[0,z]  -  v^[0,z]  never  changes  sign.  Define 
a  set  B(z,z*)  c  A  by 

B(z,z*)  Mk  €  A  :  z  2  f k(z*)J  . 


(A. 5) 
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Then  we  heve 


end 


jMO.z]  -  Z  p(kjz*  +  e) 

k  €  B(z,z*  +  t) 


v.to,*]  ■  r  p(k|z*)  . 

k  €  B(z,z*) 


Now  f^(z)  is  of  the  form 


fv(z) 


v  +  bk 

V  +  \ 


where  ajc>bic>cjc>  end  are  constants,  and  its  derivative  is 

b,  c. 


fi<*>  - 


v*k  ~  “kwk 

<V  +  dk)2 


so  fk(z)  is  monotonic.  From  the  definition  of  ffc  (6)  we  have 

VHc  ’  bkck  “  Vk>Yl<k>X’ 

so  if  X  a  0  ffe(z)  is  increasing,  and  if  X  S  0  fk(z)  is  decreasing. 
Assume  that  X  >  0.  Then 


fk(z*  +  0  *  ffcC**) 


for  all  k  €  A  so 


B(z,z*  +  «)  c  B(z,z*)  . 


We  now  show  that 


E  [p(k|z*  +  e)  -  p(klz*))  ^  0 

k  €  B(z,z*) 


(A.6) 


(A.  7) 


(A. 8) 


(A.  9) 


(A. 10) 


for  all  k  6  A. 
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Since  X  i  0  (A. 10)  Is  equivalent  to 

Z  lYo0O  -  YtOO]  *  0  .  (A.  11) 

k  €  B  0  1 

(Here  we  use  B  ■  B(z,z*)  for  convenience.)  If  either  B  or  its  complement 
BC  is  empty,  then  clearly  (A. 11)  holds  with  equality.  So  we  may  assume  that 
both  B  and  BC  are  non-empty.  Suppose  (A. 11)  is  false.  Then  we  have 


Z  Yo0O  <  Z  YnOt) 
k  €  B  U  k  €  B  1 


and 


Z  Yo0O  >  Z  Y-i  (k)  , 

kf  B11  0  k  €  Bc  1 


which  together  imply 


Z  Y0Oc)  Z  Y,(k)  >  Z  Y0(k)  E  _  Y,(k) 
•k  €  BC  °  k  €  B  1  k  €  B  0  k  6  BC  1 


But  if  k  €  B  and  j  €  B  then 


fj(Z*)  >  fj^Z*) 


(recall  B  -  B(z,z*)).  So  we  have 


(A. 12) 


(A.  13) 


(A.  14) 


Y1(J)T11(**) 

1 

Z  Y,(J  )%(**) 
i-0  x 


Y1(k)7]1(z*) 

__ 

z  Y1(k)n1(z*) 
1-0  1  1 


which  implies 


(A. 15) 


Y^JJYqOO  >  Y1(k)Y0(J)  • 


(A.  16) 
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If  we  stun  both  sides  of  (A. 16)  over  ell  pairs  (k,j)  such  that  k  €  B  end 

Q 

j  €  B  we  have 


2  eYt(J)  E  Yo0O  *  2  Yi(k)  E  Ya(J) 

j€BCl  k  6  B  °  k  €  B  1  J6bc° 


But  this  contradicts  (A. 13),  hence  (A. 11)  must  be  true. 
So  we  have 


H-tO.z]  -  E  p(k|z*  +«) 

k  €  B(z,z*+t) 


<  E  p(k|z*  +  c) 

k  €  B(z,z*) 


S  E  P(k|z*) 

k  €  B(z,z*) 


-  v^O.z] 


for  all  z  €  [0,11  as  desired. 

For  \  <.  0  we  have 

B(z,z*)  C  B(z,z*  +  c) 

and  since  (A. 11)  still  holds,  it  follows  that 


E  [p(k|z*  +  «)  -  p(k|z*)l  *  0 

k  €  B(z,z*) 

so  in  this  case 


(A.  17) 


(A.  18) 


(A.  19) 


(A. 20) 


^[O.z]  a  v^O.z] 


;  v  z  €  [o,i]  . 


(A. 21) 
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In  either  case  the  absolute  value  in  the  definition  of  the  p -distance 
(A.4)  may  be  taken  outside  the  integral,  hence  we  have 

1 

■  IJ*  Cw>1tO,*l  -  VjtO.slJdsJ  .  (A. 22) 


Now 


J*  Cl  -  liito.slldz 

0 

is  the  ejected  value  of  z  under  so 


pCp-i.v,)  •  |  E  [p(k|z*  +  s)f.  (z*  +  e)  -  p(k|z*)f.  (z*)]| 

1  1  k  6  A  K  K 

-  |  2  YiOO [T)1(z*  +  c)  -  I), (z*)] | 

k  €  A  1 

-  |X|« 

•  jxjp  CP-q.Vq)  .  (A. 23) 

If  and  vq  are  concentrated  on  a  finite  set  of  points  the  result 
generalizes  as  follows.  Let  P  be  the  class  of  distributions  on  [0,1]  which 
are  concentrated  on  a  finite  set  of  points.  Let  or*  be  the  joint  distribution 
which  achieves  p0aq,Vq).  (Since  p.g  and  Vq  are  one-dimensional  a*  is  easily 
determined.)  Now  or*  is  also  concentrated  on  a  finite  number  of  points,  say  N, 
so  for  some  set  £(x^,yi)]^-^  we  have 


«*({(xi,yi)})-ei  i  0  (A. 24) 

N 

and  E  9^  *  1.  Define  probability  measures  and  i  “  1,2 . N 

by  1-1 

^ol)({xi})  "  1 

vji>({y1)  )  -  1 


ofQ1  ^  (t  (xi»yi)»  * 1  • 


(A. 25) 
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Then  let  end  be  generated  using  (9).  Let  cr^  be  the  joint 
distribution  with  marginals  and  which  achieves  p 

For  each  i  (10)  and  (A. 23)  imply 

E  a^ix-y|]^  1^1  *  l*£  -  y^l  <A-26> 

*1 

since  and  are  concentrated  on  single  points.  Further 

A  N  m 

a.  ■  Z  0  '  is  a  joint  distribution  with  marginals  n.,  and  v.,  hence 

1  i.l  1  1 

N 

pGi1,v1)£E(V  [|x-yj]  -  Z  Q^E  (i)[|x-y|] 

1  i"l 

N 

-  z  ejANvyJ 

i-1  x 

-  l^l^t  i 

*  |x|p(lt0*v0)  •  (A. 27) 

Now  consider  arbitrary  distributions  and  vq>  Define  a  sequence  of 
distributions  Hq  for  N  ■  1,2,...  by 

nj[0,x]  -  jN"1  if  v.0 [0,x]  €  ((j-l)N-1, jN-1]  ;  j  -  1,...,N. 

Then 

p(^,H0)  -  J*  1p.q[0,x]  -n0[0,x]|dx 


N 

-  z 

x« 

rJ 

|*Aq[0,x]  ■  p>q[0,x]  |dx 

J-l  ' 

-j . 

■1 

N 

<  Z 

X, 

J  J 

IjN"1-  (j-l)H"1|dx  -  N-1 

j-l 

XJ- 

•1 
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i  I- 

;  r 
1 1 
!  r 


where  Xj  is  such  Chet  |±g[0,x]  5  JN~*  for  x  <  x^  end  |*q[0,x]  >  jH”1- 
for  x  >  Xj  . 

N  N 

Now  if  ^  end  p,^  ere  genereted  from  p^  end  p.g  using  (9)  we  have 

KtO,x]-MO.x]|  -  E  J  .  palsHki^d*)-^  (dz))  . 

i6A  f^i([0,x]) 


(A. 28) 


Since  f^  is  monotonic  f^1 ([0,x])  is  en  interval,  say  [w^y^,  and  using 
integration  by  parts 

y  y 

J*  SdlOfcJcd*) -,ig<dz))-p(i|z)<u]JtO,zl  -  n.0t°,*l>!  1 
w  w, 

i  i 

y 

+  J  1CnJ[0,z]  “P0tO,z])p'  (i  |z)dz  . 

Wi 

Now  £  p(i|z)  -  X[Yl(i)  -Y0(i)1  ^  1  so 

y 

J*  lp(i|z)(l*Q(d*))  S  M'otO,w1l  -pkglO^] +pig[0,yi] 


+  J  -^0[0,z])dz 


*  2N"1  +  p  (^g^g) 
£  3N*1  . 


It  follows  from  (A. 28)  that 


|^[0,x]  - p^[0,x] |  S  6N_1 


p^*,^)  £  J*  6N-1dx  -  6N"1  . 

If  we  define  similarly  we  heve  p(v^,v^)  £  6N-1.  We  know 

p^.v^JS  |x|pC*q,vq) 


(A. 29) 


I 


so  by  Ch«  triangle  inequality. 


p(H1,v1)  ^  |x|pV0.v"0>  +  12N  1 

£  Ix|[pC*0,v0)  +  2N‘l]  +  12N"1 
<  IxIp^q.Vq)  +  14N  1  . 


Since  (A. 30)  holds  for  all  N  we  have 

P  G^.Vj)  ^  jx|p“ Oi0,v0) 

for  arbitrary  distributions  ^q>Vq  on  [0,1]. 

Theorem  1.4. 

Fi(z0»So)  “  iE  ilzi(ii)  “2i(xi)|p(5i|z0)  ^  |X|i,|«o’*ol  (A'31) 

x  €  A 

where  zt(xi)  and  ^(x1)  are  derived  from  initial  estimates  zQ  and  zQ  using 
the  recursion  (6)  i  times. 

Proof .  We  will  use  induction  on  i.  For  i  ■  1 


Fi<zo’*o)  "  E  V*o)lp<Jl*o)  (A,32) 

j  €  A 

where  fj  and  p(j|»)  are  as  defined  in  (6)  and  (8).  Assume  X(Zq-£q)  z  0. 

Then  £j(Zq)  2  Fj(z(p  80  we  “F  remove  the  absolute  value  brackets  in  (A. 32). 

Now  Z  f .(£0)p(j|z0)  is  the  expected  value  of  a  distribution  which  assigns 
J  €  A  J 


probability  pQJzq)  to  the  point  fj(Zg);  J  €  A.  Let  {1  be  the  probability 
measure  for  this  distribution.  The.i 
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M-tO.al  -  E  P(j|z0> 

j  6  B(z,zq) 

*  v[0,z]  ^  E  A  P(j|*0) 

J  €  B(z,z0) 

where  B(z,zQ)  is  as  defined  in  (A. 5)  and  the  inequality  follows  from  (A.20). 


Next 


J  v[0,z]dz  -  E  f j (z-)p(j | *q) 
0  j  €  A  J 


so 


*i<vV  “  j  | 

“  l^(zo  *  zo^  * 


(A. 33) 


The  same  result  follows  if  X(zQ-z0)  <  0  using  corresponding  inequalities. 
So  the  result  holds  for  i  ■  1. 

Now  we  assume  Fj(z0, zQ)  ^  IM^*izo"*()l  for  J  -  i' 


Then 


ri+i<VV  •  ^+llH+i(sl+1)‘8i+i<il+1)|p(£l+l|so> 


z  f  z  |£%  (^(s1))  -  /*1(s1»|p(*1+1K<2‘1)>|  p(s1IV 

111 

5  l 


xi+l  1+1 


i+1 

i  \  I  i 


1»j  p(sllv 


SI  IxM^CS1)  “*1<51>|p<£  l*0) 


(A.  34) 


-  |xl-Fi<vV 

£  Wi+1lzo'2ol 


where  (A. 34)  follows  from  (A. 33), 
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Theorem  1.5.  If  6,  cp  €  A'(6)  then 

F1<VS0>  4  -*1<Sl>|p(£i|*0) 

X 

*  I^qI1  +  KcU-|Xe|}‘1  (A. 35) 

where 

Ke  ^  6"2[3c  +  3*  2  +  e3]  +  e  ,  (A.36) 

pQ^Uq)  is  t*ie  probability  that  x"  is  output  from  source  8  if  the  initial 
estimate  is  zQ,  and  \g  ^  1-a-p. 


A 

where  7)^  *  and  7)^  is  defined  as  T| ^ ^  using  the  parameters  for 

source  cp.  Next,  if  we  define 

K  -  ((l-8)z  +  a(l-s)M8'z+(l-a,)(l-z)] 

and 

K'  -  (e*  +  (l-a)(l-«)].[(l-p)»*+B'(i-,)] 


then  K-K'  S«.  So  since  -  Y^^O)*  and  T^T)^  -  Yi<J>Y0(j>K' 

we  have 

^JlWV  “  l<vi(J)+«)CY0(J)+c)(K'+c) -Y^(j)Y0(j)K‘| 

S  3e+3e2+e3  . 


(A.  38) 
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So  we  have  fj(z)-fj(z)^  6  [3e+3e  +e  ]  and 

MV'V-  s|*j<V  -ij<*0>l»Ol*o>  +  * 

s  s I £j («0)  -  * j (*0) I p  ' t J I ■ *0)  +s  I 1 (*„>  -  5j  <*o> I ' e 0 1 > *0> +  « 

S  *1*0  "  *ol  +K« 

s  lx, |  +  K8  • 

Equation  (A. 39)  follows  from  Theorem  1.4.  Next 


Fi+l(z0,z0)  "  2  2  fx  (zi^  ‘  fx  (2i(-  5  p(xi+l  Zi&  ))f 

i+i  u  u  t  xi+1  i  xi+1  1  XTl  1 

jLi+i  J 


so  from  (A.39)-(A.40) 


WV*o>£  I ’■el  ?  l^l(it)-Si(3s1)|?(£iIV+K. 


and  we  solve  the  recursion  (A. 41)  with  initial  condition  (A. 40)  to  get 


h<‘o-lo>s  l\,l1+V1-lxel  5-1  • 


(A.  39) 
(A. 40) 

?(slIV 


(A.4i.) 


(A. 42) 


MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  OF  STANDARDS- 1963-A 


toram  1.6.  If  0  €  A(t),  6  -  (Q,(P(x)])  then 


Dn  4  ||£T(xn)[x'f(xn)l]'1  -  ft(xn)tft(xa)U"1l|  S  r1  (A. 43) 

for  any  probability  row  vectors  z  and  £,  where  Q  is  defined  in  (65)  and 
xn  -  (xlt...,xQ). 

Proof.  Throughout  this  proof  xn  will  be  a  fixed  vector  of  source  outputs 
and  x"  will  be  used  to  denote  the  first  i  components  of  xn  for  1  £  n.  Define 
a(j)  -  T(xJ)l  end  let  be  the  set  of  columns  of  T(xJ).  Then 


D 

n 


max 

Cn 


£.U 

**St(n) 


A 

i»£(n) 


(A.44) 


£ 


max 
u  €  C 


[. 


u. 


max 


€  V. 


<Jt(n) 


-  min 
i  6  V. 


at(n) 


(A.45) 


where 

vn  6  Ci:c1(n)  >  0},  (A.46) 

since  and  (^(n)  are  non-negative.  We  will  prove  &n  ^  Qn~l  by 

induction.  First  4^  5  1  since  u^  2  0  and  o^(l)  2  u^  for  ti  €  C^.  Now  assume 
<,  Q?  Note  that 

lQP(xi)u]t  u| 

[QPCx^aCJ-l)^  "  S“(j)  (A'47> 

where  u  €  Cj^  and  u'  €  Cj.  Also  if  we  define 


Vj  -  Ci:[P(xJ)2(j-l)]1  >  0J 


(A. 48) 
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Vj  c  Vj,  and  since  p<x^|l)  >0  Implies 

[PCx^u^ 


[POtjteO-Dli  ffjO-l) 


we  have 


f  [P(Xj)u]t  [P(*j)u]1  "J 

Let  w  «  P(Xj)g(j-l)  and  jr  ^  P(Xj)u  for  some  fixed  x  €  Cj_i*  Let 
a(u)  ^  max„  [y  /w  ]  and  b(u)  «  min  [y./v. ].  Also  define 

J  f  «  11  J  /■  i  I* 


*  lQ  2]i 

fi(u)  -  max  ■-■■ — L— 

i  IQ  *lt 

-  tQ  i\t 

b(u)  *  min  7— — s—  . 

i  IQ  wlt 


Then  if  W  -  {(k.mj-.y^  >  y^} 


^•isisa 'mxc- 


(A. 49) 


(A. 50) 


(A. 51) 
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I. 

c 

[ 

[ 

l 

I 

I 


i(u)  -£(u) 


£  q(ijk)q(i|»)(ykwn  -  yawk) 

k«® 


rj  t  q(i|k)q(x|n)wkwn 
*  k,n 


£  tq(i|k)q(i|»)  -q(i|»)qW|k)Uykwm-ymwk) 

€WtqU|k}q06|»)  +  q(i|m)qU|k)]vv 
l,A  <k,m)€W 


t  q  (i  |  k)q  (i  |  ®)  -  q  (i  |  »)q  Ce  |  k)  ]  (ykwB  -  yavk) 


(ykWm  ~  ymWk) 
w  w. 


(l-(S-l)t)  +«  (k,m)  €  W  m  k 


From  (A. 46)  and  (A. 51)  ve  have 


A ,  ■  ®«x  (S(u)  -£(u)) 

—  €  Cj-1 

<,  max  [;(«(«)  -b(u)] 

*€Cj-l 


S  C  A- 


J-l 


<  L 


J-l 


ae  desired. 


(A. 52) 


(A. 53) 
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Th«nr»«  1.7.  If  6,<p  €  A1  (5),  and  the  ptrmttri  for  8  and  <p  ara  within  a. 


than  if  X  la  the  coda  for  9  we  have 
n 


r  (1  ,8)  ^  Kn  +  Kt  . 
n  a 


Proof,  For  convenience  let  qQ^jag)  ■  P-Cs^Isq)  and  pCx^Sq)  *  PqCx^Xq) 


in  thia  proof.  Than 


r  (X  ,0)  -  n'^l  +  E  p(x°|a0)C  min  [f-log  q(xa|k)l}]  +  log  p(x“|*0)} 


x*  k-0,1 


n-1  p(x..Jzi(xi)) 
£  n‘l  2  +  I„  p(H,‘|,0)  I  log  11  t- 
x  i-o  q(xi+xl*i(x  )) 


(A. 54 


where  ^(x1)  la  the  estimate  derived  from  outputs  x1  with  zQ  -  0.  Now  since 


log  x5  (x-l)log  e  and  p^Ug)  "  E  P(xnlz0) 

xi+l» * * *  ’xn 

we  have 

1  £+t 

r  (X  ,0)  ^  n"  [2  +  E  6  log  e  E,  . ,  p(x  |*0)‘ 


tPCXi+lKCS1))  ■  q^xi+ll2iC*i))U  •  (A-55 


|p(x|«)-q(x|i)|  ;S  |p(xjs)-p(x|S)|  + |p(x|x)-q(x|£)  | 

-  1y0(x)-Y1<x>M*-2|  +hxl(i)+\0(«)  -T)xl(2)  -\o<£)I 


(A. 56 


where  T|  .  is  defined  in  (A. 37). 
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nRn(9)  *  Hn(J)  *  nRn(8)  -  1  .  (A.63) 

Let  Zq  be  the  lnltlel  etete  of  the  switching  process  end  let  T  be  the 
first  switching  time  (we  set  I  ■  n  if  the  first  switch  occurs  after  tine  n). 
Then 

nK(0)  «Hn(X|x0,...) 

a  Hn(X|Z0,T,X0,...) 

-  HQ(X|Z0,T).  (A. 64) 


since  X^,  1  *  1,  is  Independent  of  (Xq,X_^,...)  given  Zq.  Next 

Hn&lZ0’T)  "  El"  Z  T  Yz  (*)log  Yz  (*)  -  Z  a..jP(*lz*>log  P(*|**)l 
x  €  A  0  0  x  €  A 


(A. 65) 


where  the  expectation  is  over  Zq  and  T,  and 

y„(x)  A  n  y Ax  ) 

Z  i-l  i  x 


n  p[x.  •  x.  jz0  - 

i-i  1  1  v 


(A. 66) 


for  x  €  A  .  We  know 


Hn-k<2>  4  Hn^>  '  k  lo*  J 

since  the  source  alphabet  has  J  letters.  Since  the  second  sun  of  (A. 65)  Is 
the  expectation  over  T  of  Hq_^(X)  and  since  the  first  sun  is  positive  we 


have 
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Ha(£l Z0,T)  *  E[Htt.T(X)] 

*  Hq(X)  -  E[T]log  J 

^  Hn(X)  -  j  l°g  J  .  (A. 67) 

So  from  (A. 61)  and  (A. 64)  we  have 

Hq(X)>  n3Cc(0)  *  Hn(X)  -  £  log  J  (A. 68) 

hence  from  (A. 63) 

Rn(0)  -Kc(0)  <  n_1(l  +  i  log  J)  (A. 69) 

as  desired. 
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APPENDIX  B 

PROOF  OF  REDUNDANCY  BOUND  FOR  THE  VL-FL  CODE  OF  CHAPTER  2 

In  [13]  •  set  9  -  [<p. ;  t  -  }  i«  constructed  with 

n  i  n 

Yn  -  S  exp2[s£r^(J-l)log  9jnl  +  ft  log  jl  +j}]  (B.l) 

such  thet  if  6  €  A  hss  transition  probabilities  pQ  and  initial  state  sQ 
then  there  exists  a  cp  €  9q  with  transition  probabilities  p^  and  initial  state 
Sq  such  that 


A  Pe(*|J)  .i 

*»<»*■  VJ)  *  x  *  Ape(x|J)1°8  rsur s  “ 


for  all  j  -  1,2, . . .  ,S .  Further,  if  cp  is  in  9q  then 


CB.2) 


ninCPjpCxIj)^  j  €  J,  x  €  A]  -  . 


Let  9q  be  this  set.  From  (152),  if  nin-f  flog  yl  then 


(B.3) 


R-crt.e)  <  in+riogYnli‘ti0(rn)]"1 


(B.4) 


where  r  is  the  code  for  a  and  3C  (p_,p  ;J)  <>  n”  for  j  ■  1,2,  ...,S.  For. 
n  r  <3  cp 

convenience  denote  p.  by  v  and  p  by  Tl.  Now  2  n  is  the  average  probability 

“  cp 

of  the  strings  in  r  ,  so  by  (148) 

•n(x)/2"n  S  -  S  9Jn  (B.5) 

at 

where  at  is  the  minioum  transition  probability  of  cp.  So  we  have 


n»  Z  v(x){log[T)(x)/2“  ]  +  log[v(x)/71(x)l  -  log  v(x)} 

x  €  r 

—  n 


S  2  v(x)C-log  v(x)  +  log[v(x)/7)(x)]}  +  log  9Jn  . 

x  €  r 


(B.6) 


•k.  m 

• 

*  '  ■ 

•  .r  "* 

“:rr-  • 

-  .-*1 
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Next  define 


A  A(x)  v(x. Is.  .) 

■  2  v(x)log[v(x)/T1(x)]  -  Z  v(x)  Z  log  \ 

x  €  r  x  €  r  i*i 

"  n  "•  n 


L 

•  Z  Z  ,  ,  Z 


v<xil8i-l> 


2  i.x  2  v(x)log  r-7-^p  ~ 
i-1  aCA*  1  x€B(a)  T1Cl|8i-l)  (B*7) 


where  sQ  -  j,  et  -  {(x^j.i^j),  L  -  maxU(x):  x  €  Tn}  and 
B(«)  A  (x  €  Tq:  x^c^,  1  <  k  <£  1-1,  A(x)  *  i}. 
Note  BQj)  «  rn  for  1*1.  Next,  since 


(B.8) 


t(*) 

v(x)  -  v(a)  II  v(xkjsk-  )  ,  (B.9) 

k-i 


if  we  split  £  into  Z  Z  we  have 

x  €  B(a)  P  €  A  x  €  B(a) 

P 


n 


L 

Z 

i-1 


Z  .  .v(i>  Z  v(P|s  )log 
a  €  Aw  p  €  A 


vCPis^i) 

TKPl®i-l> 


X(x) 

z  n  v(x.  | s.  .) 

x  €  B(or)  k-i+1 


xt-P 


(B.10) 


Now  in  an  encoding  tree  the  total  probability  of  the  leaves  which  may 
be  reached  by  passing  through  a  given  node  is  equal  to  the  probability 
of  that  node.  The  innermost  sum  is  the  total  probability  of  the  paths 
from  a  node  to  all  of  its  leaves,  which  is  the  total  probability  of  the 
leaves  divided  by  the  probability  of  the  node.  Hence  this  sum  is  one 
unless  it  is  empty.  Further,  since  each  non-terminal  node  has  J  successors 
(one  for  each  x  €  A)  if  no  x  €  B(or)  exists  such  that  x^  -  P  then  all 
x  €  B(at)  must  be  of  length  1*1.  So  if  the  innermost  sum  is  zero,  then 


■ 


99 


So  from  (B.4),  (B.15),  and  (B.16) 

„  x,-l  -1  (1°8  9Jn+rlog  YT)0Ca  .0)+n-1) 

n[i9<rn)i  ^3ccrn.e)+n  l+ - -.-lo;9Jn  ° - 


(B.17) 


Define  the  entropy  of  the  set  of  strings  T  as 


3C(T,0)  ■  -  r  v(x)log  v(x)  . 

x  6  r 


(B. 18) 


Since  the  encoding  tree  for  r  Is  a  subset  of  the  tree  for  we  have 

Q  n 


3C<rs,9)  a  K(Tn, 6)  . 


(B.19) 


Further 


—  — 


W  4  W 


therefore 


SUg^n)]'1  -3C(Tn,0)  -  [£ 


a  [5 


n  n 


(B.20) 


FVom  (B.l)  we  have 


Tlog  yJ  ^  %  S  (J-l)logn  +  C 


(B.21) 


where 


C  ^  logs  +  %S(J-l)log  9J  +  2S  logj  +  S(J  +  2)  +  1 


(B.22) 


so  we  may  rewrite  (B.17)  as 
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rA(Ta,9)  £3C(T 
n  n 


.•e>  [ 


%  S  (J-l)logn  4-  logn 
n  -  log  9Jn 


■] 


Q  -  log  9Ju 


(B.23) 


CQ  A  (log  9J  +  C)K(rn,e)  +  2  +  n“1(%  S(J-l)logn  +  C)  .  (B.24) 

Now  if  we  follow  Che  steps  la  (B.7)-(B.ll)  but  with  O'  ■ -E  v(x)log  v (x)  in 
place  of  n  we  get 


L 

Q'  -  -2  2  v(«)f(a)  2  vGls^log  vtfls^)  (B.25) 

i«l  a€A  P  €  A 


and  since  the  innermost  sum  is  less  than  or  equal  to  log  J 


K(rQ,9)  -  o' tle (r^]"1^  1°8  J  • 

(B.26) 

From  (B.21) 

we  have 

CnSK 

-  (log  9J+C)(logJ)  +  2  +  k  S(J-l)  +  C  . 

(B.27) 

Let 

K  ^  \  S(J-l)  +  1  . 

(B.28) 

Then 

n  <  n  +  (K  -  l)log  n  +  C 

(B.29) 

and 

<  K  leg  J»log  n  +  K 

nv  n'  a  *  ,  a  ^ . 

n  -  K  log  n  -  C1 

(B.30) 

where 

C'  ^  C  +  log9J  . 

(B.31) 

Then  if  we  define 


K(n) 


«“1 

n 


[K  log  n  +  C'] 


(B.32) 
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we  have 


r-crj)  £  aml[i  log  J.log  a  + 


Kll  E 

i-0 


[K<a)]] 


So  If 

a  i  2(K2  +  C'Klog  n)2 

we  have 

«-<rJ)  ^  n’^og  J[%S(J-l)log  n  +  log  n]+S‘1(2K  +  l)  . 


CB.33) 


(B.34) 


(B.35) 


as  desired. 

If  A  Is  the  class  of  binary  memory less  sources  we  may  eliminate  the  log  n 
term  from  (B.35).  We  let  $n  be  defined  as  {qj^:  1  -  l,...,Yn]  where 


( 2lV 


*4 


,  for  1  <  1  ^  %  y„ 


(B.36) 


1-2(Yn"  i-+1)2Y^2  ,  for  %  Yn  <  i  ^  Yn 


and  Y_  *  lA/rv|  .  Then  It  is  easily  shown,  ([12]  equations  (18)-(20)),  that  for 


-1 


9  ^  .5  If  0  €  tcP^>cPt+il  then  ^r(®*^l+l>  ~  *  We  ““y  replace  log  9jn  In 

(B.6)-(B.23)  with  log  q}"1,  where  0  €  [cp^.cp^]  and  (B.23)  becomes 


r-(T»)  £  H(0) 
n  n 


k  log  n  +  log  9^ 


-  log  cpt 


P1] 


n  -  log  qj’1 


(B.37) 


But  for  9  €  w*  have 

(log  (p’1) H(9)  £  log  9“l  H(9) 

<  1.69  .  (B.38) 

By  the  same  steps  ss  (B.30)-(B.33)  this  Implies 

r»(T*)  S  n"^  log  n  +  K,n 
n  n  j 

where  ^  2(K+1.69),  for 

n*  (M  log  S  +  1.69  +  K) (6+2  log  n). 


For  9  >  .5  the  same  bound  holds  since  i  Is  symmetric  about  9 


APPENDIX  C 


PHOOFS  AND  DERIVATIONS  FOR  CHAPTER  3 
Theorem  3.1.  If  fQ  Is  a  code  and 


KQ(8;q>)  +  3Cn(q>;0)  <  § 

then 

|D(fn;0)  -  D(fQ;9)|  *  5(2  log  e)"%  . 


Proof  of  Theorem.  Let  J  be  a  positive  Integer  and  define 

Am  -  (  x  €  n"1da(x,fn(x))  € 

If  we  define  h(x)  ^  pfl(x)  -p  (x)»  J'  ^  Tn  5  Jl  ,  and  H  ^  (x:h(x)  Si  0} 

9  Cp 

we  have 

J' 

n|D(fQ;e)  -D(fn;qJ)|  -  1  2  JK  da(x.fn(s»h(x)dx| 

h(x)dx  +  J-1  JA  nHh(x)dx| 
m  m 

$k  h(x>d£  +  J’1  J*H  hC*)d*l 

in 

J*^hC*)dx|  +  J-1 


m-i 


J' 

£  |  £  mJ 

of! 


-1 


J' 

|  £  nvJ 
dfI 


-1 


J’  -1 
<  J  £  mJ 

ofI 


J’ 

I* 

bof! 


where  nQ(B)  ^  J  Pg<5)d*  f°r  B  c 


«i'l(u9(A.)  -  »,(*„)]  I  +  j 

mn. 


104 


For  each  of  notation  lot  pB  ^  nQ(AB)  and  ^  H?(Am)  for  »•  1.2, .. .,J* .  By 


definition  (177) 


K  (9;<p)  a  Z  Ptlog  -*■ 

1-1  1  V 


so  the  problea  reduces  to  finding 


■ax  g(p,q)  -  max  Z  mj’(p  -a) 

Cpt.qt}  CPi^i) 


(C.6) 


subject  to  the  following  constraints: 

a  J'  Pi  J'  ’l 

1)  3C*(p,q)  «  Z  Pt  log~+  Z  q,  log  <  ? 

1-1  1  ql  1-1  1  pi 


11)  Ep.-Zl-1 
1  1 


111)  pt  a  0  and  qt  a  0  Y1  . 

Since  max  g(p,q)  -  -min  g(p,q)  by  symmetry,  we  may  bound  (C.5)  by 

tp^i)  CPi.qtl 

bounding  (C.6).  From  1)  we  see  that 

P,  -  0  •  q,  -  0  . 


$(p»q»X,X_,X  ,v_  ,v  ) -g(p,q) +X  3C»(p,q) 

r  H  r§  •  Hp  • 

+  x_  £  p«+x„  E  q<  +E  v  .  p4  v„,q4 
p  i  1  q  i  n  i  Pl  1  i  ql  1 

where  X,Xp»Xq  are  Lagrange  multipliers  and  vpt  and  vql  are  *ero  If  pt 
and  q^  are  positive,  and  positive  If  p^-q^-0.  We  must  have 

-  0  and  -  0  at  the  maximizing  (p,  ,q. } .  Let  W  be  the  subset  of 
bPt  oq^t  1  * 

(1,2,...,J'}  such  that  1  €  W  implies  and  q^  are  positive.  Then  for 
1  €  W  we  have 


(C.7) 


sc*- 
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and 


The**  Imply 

r  qi  pii 

•I***-#*  w# 

which  Is  equivalent  to 


Pt  \  « 

—  +  —  -  K 
qi  Pi 


(C.8) 

(C.9) 


(C.10) 


where  K  is  independent  of  i.  Now  (C.10)  can  have  only  two  solutions  for 

pi  -1 

— ,  some  T)  and  j\  .  But  from  (C.8) 


log  —  +  1  -  -~- 


<L  -U  -  X 
i - R 


(c.ii) 


The  left-hand  side  is  a  function  of  —Whence  it  has  at  most  two  distinct 
values,  but  the  right-hand  side  is  different  for  all  1.  So  we  must  have 
only  two  elements  in  Wi  that  is,  only  two  pairs,  say  (Pa»4a)  sod  (p^.q^), 

may  be  non-zero.  Now  ■  1  -p.  and  q  ■  1  -q.  from  11)  and  from  (C.ll) 

pa  qb  ° 

w*  have  — •  ■  —  so 

qa  pb 


p*  -  qa  “  2P.  '  l- 


(C.12) 


The  values  of  a  and  b  do  not  affect  the  relative  entropy  constraint  1),  but 

g(p.q)  ■  J’1t»(pa-qa)  +  b(pb-qb)] 

Ip^q*}  tp^qj} 

-  max  j’1(a-b)(2p  -1). 

Cpi) 


(C.13) 
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So  for  «  m  au«t  have  a»l,  b-J'  or  visa  versa.  The  problast  is 

now  simplified  to  finding  tha  maximum  of  |2pa  -  l|  subject  to 

S<P4>  A  2[p4  log  ^  +  <l-p4)  log  S  5 

and  1  >  >  0.  Now 

l(Pa)  i  2  log  a  [2p4-ll2  (C.1A) 

so 

5  a  2  log  a  [2pa  -  l]2 
or 

|2pa-l|  £  ?*<21og  e)“*  .  (C.15) 

So  from  (C.13)  we  have 

max  I  mj_1(pm-qm)  S  j'j’1  $*<21og  e)‘* 

£p1»q1}  »*1 

S  nD  5^(21og  e)”^ 
which  substituted  into  (C.S)  gives 

|D(fn;8)  -  D<fn;cp> I  S  D  g^(21og  e)"^  +  (nJ)“l  .  (C.16) 

Since  (C.16)  holds  for  all  J 

|D(fn;e)  -  D(fn;tp)|  S  D  ?*(21og  a)’* 


as  desired 


(C.17) 


Derivation  of  Ea.  (199).  Lot  8  ■  (a1,...,ak>a  )  end  cp  »  (a£,...,a£td 

and  assuino  [|8  -  cp)|  S  «.  Given  Initial  state  x°  -  (x_^ . x_^) 

•1  .  n  PqCjtlS  ) 


3CQ(8;<p)  -  n"1  J*  Q  P0(*jx  )log 


where 


*1  o 

“  j-o  J»k  P8(xl-1 . "j-kls  >  J„  V*jb-i . 


Pfl  |X4«,i>  *  *  *  ,X4»fc) 
lo8  AA/  - - dx.,...,dx,  k, 

Vi1!*1 .  J*k)  J  J 


P0(£|*°)  -  Vq  P0(*jlxj-1 . xj.k> 


The  inner  integral  of  (C.19)  is  the  entropy  of  a  Gaussian  distribution 

o  2 

7l(ny<J  )  relative  to  a  distribution  ,<?  )  where 


We  my  uilly  evaluate  this  Integral  to  got 
3C  (9;<p)  S  H  log  e[<o2  -tf2)2^)'2 


-1  «2 


*  A  J,*k  P8<Xj‘l . *i*)(*i'*i 


-4  ,  -1  -2 


n-1  k  k 


<,h  t  log  eCa.  +n  ff.  Z  I  E  EfttX.  .X  ]) 
1  X  1-0  i-1  B-l  8  J*“ 


Since 


i  o  j 

Xt  "  It  x  +  E  YJ  Z.  ,h 
1  j-0  1  J  1 


and  x  -  0  we  have 


i-l  4  2  2 

-  E  [f J]f  .  <y 

J-0  1,1 


Further 


EtX^j]  S  ®ax(E[X2]  ,E[X2]) 


2  2 

and  from  (C.33)  E0[Xt]  i  E0[Xj]  for  1  a  J,  so 


-J  s  V8> 


If  we  define 


o*<0>  A  urn  E  [X2]  . 

J  -  «  J 


o2  «  max  02(0) 
0  €  A 

ve  have  from  (C.20) 


3^(0;<p)  £  %  «2  log  e[o^+02  k2  02  ]  (C 

as  desired. 

a-1  , 

Derivation  of  (2071 .  Here  we  bound  Z  .  To  do  this  we  first 

1  .  .  k  i-0  . 

bound  where  t  •  s^l  *  Now  *  has  eigenvalue-eigenvector 

decomposition 

Y1.  v"1  E1  V. 

Let  V  »  CvAm}  and  v'1  -  {u^J  and  define 


1''6%UPA““Xtl"J'Wl*11’ 

Since  A  is  compact  and  V  is  intertable  for  all  8  €  A  (recall  a^  i  0  for 
0  €  A)  we  have  7)  <  •.  Let  X  be  the  maximum  eigenvalue  for  any  ¥ 
corresponding  to  a  source  0  €  A.  To  bound  the  worst  case  is  where  ¥ 

has  a  single  eigenvalue  equal  to  X.  This  gives 


E  - 


l  0 
1 

l*  * 


ben  we  may  compute  v'^E1,  to  get  [23,  pp.  156] 
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